Planned maintenance
A system upgrade is planned for 10/12-2024, at 12:00-13:00. During this time DiVA will be unavailable.
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Efficient mapping of accurate long reads in minimizer space with mapquik
Stockholm University, Faculty of Science, Department of Mathematics. Stockholm University, Science for Life Laboratory (SciLifeLab).ORCID iD: 0000-0001-7378-2320
Show others and affiliations
2023 (English)In: Genome Research, ISSN 1088-9051, E-ISSN 1549-5469, Vol. 33, no 7, p. 1188-1197Article in journal (Refereed) Published
Abstract [en]

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps-fundamental bottlenecks to read mapping-for both the human and maize genomes with >96% sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a 37x speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a 410x speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic O(n) pseudochaining algorithm, which improves upon the long-standing O(nlogn) bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.

Place, publisher, year, edition, pages
2023. Vol. 33, no 7, p. 1188-1197
National Category
Bioinformatics and Systems Biology
Identifiers
URN: urn:nbn:se:su:diva-221725DOI: 10.1101/gr.277679.123ISI: 001059942600001PubMedID: 37399256Scopus ID: 2-s2.0-85167896090OAI: oai:DiVA.org:su-221725DiVA, id: diva2:1800912
Available from: 2023-09-28 Created: 2023-09-28 Last updated: 2023-09-28Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textPubMedScopus

Authority records

Sahlin, Kristoffer

Search in DiVA

By author/editor
Ekim, BarisSahlin, Kristoffer
By organisation
Department of MathematicsScience for Life Laboratory (SciLifeLab)
In the same journal
Genome Research
Bioinformatics and Systems Biology

Search outside of DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 30 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf