Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Entropy predicts sensitivity of pseudorandom seeds
Stockholm University, Faculty of Science, Department of Mathematics.ORCID iD: 0000-0001-8442-0536
Stockholm University, Faculty of Science, Department of Mathematics.ORCID iD: 0000-0001-7378-2320
2023 (English)In: Genome Research, ISSN 1088-9051, E-ISSN 1549-5469, Vol. 33, no 7, p. 1162-1174Article in journal (Refereed) Published
Abstract [en]

Seed design is important for sequence similarity search applications such as read mapping and average nucleotide identity (ANI) estimation. Although k-mers and spaced k-mers are likely the most well-known and used seeds, sensitivity suffers at high error rates, particularly when indels are present. Recently, we developed a pseudorandom seeding construct, strobemers, which was empirically shown to have high sensitivity also at high indel rates. However, the study lacked a deeper understanding of why. In this study, we propose a model to estimate the entropy of a seed and find that seeds with high entropy, according to our model, in most cases have high match sensitivity. Our discovered seed randomness–sensitivity relationship explains why some seeds perform better than others, and the relationship provides a framework for designing even more sensitive seeds. We also present three new strobemer seed constructs: mixedstrobes, altstrobes, and multistrobes. We use both simulated and biological data to show that our new seed constructs improve sequence-matching sensitivity to other strobemers. We show that the three new seed constructs are useful for read mapping and ANI estimation. For read mapping, we implement strobemers into minimap2 and observe 30% faster alignment time and 0.2% higher accuracy than using k-mers when mapping reads at high error rates. As for ANI estimation, we find that higher entropy seeds have a higher rank correlation between estimated and true ANI.

Place, publisher, year, edition, pages
2023. Vol. 33, no 7, p. 1162-1174
National Category
Bioinformatics (Computational Biology)
Identifiers
URN: urn:nbn:se:su:diva-225341DOI: 10.1101/gr.277645.123PubMedID: 37217253Scopus ID: 2-s2.0-85168804709OAI: oai:DiVA.org:su-225341DiVA, id: diva2:1827718
Funder
Swedish Research Council, 2018-05973Swedish Research CouncilSwedish Research Council, 2021-04000Available from: 2024-01-15 Created: 2024-01-15 Last updated: 2024-02-09Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textPubMedScopus

Authority records

Maier, Benjamin DominikSahlin, Kristoffer

Search in DiVA

By author/editor
Maier, Benjamin DominikSahlin, Kristoffer
By organisation
Department of Mathematics
In the same journal
Genome Research
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 12 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf