Change search
Link to record
Permanent link

Direct link
Publications (10 of 15) Show all publications
Karami, M., Mohammadi, A. S., Martin, M., Ekim, B., Shen, W., Guo, L., . . . Sahlin, K. (2024). Designing efficient randstrobes for sequence similarity analyses. Bioinformatics, 40(4), Article ID btae187.
Open this publication in new window or tab >>Designing efficient randstrobes for sequence similarity analyses
Show others...
2024 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 40, no 4, article id btae187Article in journal (Refereed) Published
Abstract [en]

Motivation: Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis. However, k-mers are limited to exact matches between sequences leading to alternative constructs. We recently introduced a class of new constructs, strobemers, that can match across substitutions and smaller insertions and deletions. Randstrobes, the most sensitive strobemer proposed in Sahlin (Effective sequence similarity detection with strobemers. Genome Res 2021a;31:2080–94. https://doi.org/10.1101/gr.275648.121), has been used in several bioinformatics applications such as read classification, short-read mapping, and read overlap detection. Recently, we showed that the more pseudo-random the behavior of the construction (measured in entropy), the more efficient the seeds for sequence similarity analysis. The level of pseudo-randomness depends on the construction operators, but no study has investigated the efficacy.

Results: In this study, we introduce novel construction methods, including a Binary Search Tree-based approach that improves time complexity over previous methods. To our knowledge, we are also the first to address biases in construction and design three metrics for measuring bias. Our evaluation shows that our methods have favorable speed and sampling uniformity compared to existing approaches. Lastly, guided by our results, we change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. We suggest combining the two results to improve strobealign’s accuracy for the shortest reads in our evaluated datasets. Our evaluation highlights sampling biases that can occur and provides guidance on which operators to use when implementing randstrobes.

Availability and implementation: All methods and evaluation benchmarks are available in a public Github repository at https://github.com/Moein-Karami/RandStrobes. The scripts for running the strobealign analysis are found at https://github.com/NBISweden/strobealign-evaluation.

National Category
Bioinformatics (Computational Biology) Construction Management
Identifiers
urn:nbn:se:su:diva-229044 (URN)10.1093/bioinformatics/btae187 (DOI)001206629000004 ()38579261 (PubMedID)2-s2.0-85191199242 (Scopus ID)
Available from: 2024-05-20 Created: 2024-05-20 Last updated: 2024-05-20Bibliographically approved
Sahlin, K., Baudeau, T., Cazaux, B. & Marchet, C. (2023). A survey of mapping algorithms in the long-reads era. Genome Biology, 24(1), Article ID 133.
Open this publication in new window or tab >>A survey of mapping algorithms in the long-reads era
2023 (English)In: Genome Biology, ISSN 1465-6906, E-ISSN 1474-760X, Vol. 24, no 1, article id 133Article, review/survey (Refereed) Published
Abstract [en]

It has been over a decade since the first publication of a method dedicated entirely to mapping long-reads. The distinctive characteristics of long reads resulted in methods moving from the seed-and-extend framework used for short reads to a seed-and-chain framework due to the seed abundance in each read. The main novelties are based on alternative seed constructs or chaining formulations. Dozens of tools now exist, whose heuristics have evolved considerably. We provide an overview of the methods used in long-read mappers. Since they are driven by implementation-specific parameters, we develop an original visualization tool to understand the parameter settings (http:// bcazaux.polytech-lille.net/Minimap2/).

National Category
Environmental Biotechnology Biological Sciences
Identifiers
urn:nbn:se:su:diva-218376 (URN)10.1186/s13059-023-02972-3 (DOI)001000395100002 ()37264447 (PubMedID)2-s2.0-85160969471 (Scopus ID)
Available from: 2023-06-27 Created: 2023-06-27 Last updated: 2023-06-27Bibliographically approved
Ekim, B., Sahlin, K., Medvedev, P., Berger, B. & Chikhi, R. (2023). Efficient mapping of accurate long reads in minimizer space with mapquik. Genome Research, 33(7), 1188-1197
Open this publication in new window or tab >>Efficient mapping of accurate long reads in minimizer space with mapquik
Show others...
2023 (English)In: Genome Research, ISSN 1088-9051, E-ISSN 1549-5469, Vol. 33, no 7, p. 1188-1197Article in journal (Refereed) Published
Abstract [en]

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps-fundamental bottlenecks to read mapping-for both the human and maize genomes with >96% sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a 37x speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a 410x speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic O(n) pseudochaining algorithm, which improves upon the long-standing O(nlogn) bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.

National Category
Bioinformatics and Systems Biology
Identifiers
urn:nbn:se:su:diva-221725 (URN)10.1101/gr.277679.123 (DOI)001059942600001 ()37399256 (PubMedID)2-s2.0-85167896090 (Scopus ID)
Available from: 2023-09-28 Created: 2023-09-28 Last updated: 2023-09-28Bibliographically approved
Maier, B. D. & Sahlin, K. (2023). Entropy predicts sensitivity of pseudorandom seeds. Genome Research, 33(7), 1162-1174
Open this publication in new window or tab >>Entropy predicts sensitivity of pseudorandom seeds
2023 (English)In: Genome Research, ISSN 1088-9051, E-ISSN 1549-5469, Vol. 33, no 7, p. 1162-1174Article in journal (Refereed) Published
Abstract [en]

Seed design is important for sequence similarity search applications such as read mapping and average nucleotide identity (ANI) estimation. Although k-mers and spaced k-mers are likely the most well-known and used seeds, sensitivity suffers at high error rates, particularly when indels are present. Recently, we developed a pseudorandom seeding construct, strobemers, which was empirically shown to have high sensitivity also at high indel rates. However, the study lacked a deeper understanding of why. In this study, we propose a model to estimate the entropy of a seed and find that seeds with high entropy, according to our model, in most cases have high match sensitivity. Our discovered seed randomness–sensitivity relationship explains why some seeds perform better than others, and the relationship provides a framework for designing even more sensitive seeds. We also present three new strobemer seed constructs: mixedstrobes, altstrobes, and multistrobes. We use both simulated and biological data to show that our new seed constructs improve sequence-matching sensitivity to other strobemers. We show that the three new seed constructs are useful for read mapping and ANI estimation. For read mapping, we implement strobemers into minimap2 and observe 30% faster alignment time and 0.2% higher accuracy than using k-mers when mapping reads at high error rates. As for ANI estimation, we find that higher entropy seeds have a higher rank correlation between estimated and true ANI.

National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:su:diva-225341 (URN)10.1101/gr.277645.123 (DOI)37217253 (PubMedID)2-s2.0-85168804709 (Scopus ID)
Funder
Swedish Research Council, 2018-05973Swedish Research CouncilSwedish Research Council, 2021-04000
Available from: 2024-01-15 Created: 2024-01-15 Last updated: 2024-02-09Bibliographically approved
Petri, A. J. & Sahlin, K. (2023). isONform: reference-free transcriptome reconstruction from Oxford Nanopore data. Bioinformatics, 39, i222-i231
Open this publication in new window or tab >>isONform: reference-free transcriptome reconstruction from Oxford Nanopore data
2023 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 39, p. i222-i231Article in journal (Refereed) Published
Abstract [en]

Motivation With advances in long-read transcriptome sequencing, we can now fully sequence transcripts, which greatly improves our ability to study transcription processes. A popular long-read transcriptome sequencing technique is Oxford Nanopore Technologies (ONT), which through its cost-effective sequencing and high throughput, has the potential to characterize the transcriptome in a cell. However, due to transcript variability and sequencing errors, long cDNA reads need substantial bioinformatic processing to produce a set of isoform predictions from the reads. Several genome and annotation-based methods exist to produce transcript predictions. However, such methods require high-quality genomes and annotations and are limited by the accuracy of long-read splice aligners. In addition, gene families with high heterogeneity may not be well represented by a reference genome and would benefit from reference-free analysis. Reference-free methods to predict transcripts from ONT, such as RATTLE, exist, but their sensitivity is not comparable to reference-based approaches.Results We present isONform, a high-sensitivity algorithm to construct isoforms from ONT cDNA sequencing data. The algorithm is based on iterative bubble popping on gene graphs built from fuzzy seeds from the reads. Using simulated, synthetic, and biological ONT cDNA data, we show that isONform has substantially higher sensitivity than RATTLE albeit with some loss in precision. On biological data, we show that isONform's predictions have substantially higher consistency with the annotation-based method StringTie2 compared with RATTLE. We believe isONform can be used both for isoform construction for organisms without well-annotated genomes and as an orthogonal method to verify predictions of reference-based methods.Availability and implementation

National Category
Biological Sciences Environmental Biotechnology Computer and Information Sciences Mathematics
Identifiers
urn:nbn:se:su:diva-220840 (URN)10.1093/bioinformatics/btad264 (DOI)001027457000029 ()37387174 (PubMedID)2-s2.0-85163651809 (Scopus ID)
Available from: 2023-09-14 Created: 2023-09-14 Last updated: 2024-03-26Bibliographically approved
Namias, A., Sahlin, K., Makoundou, P., Bonnici, I., Sicard, M., Belkhir, K. & Weill, M. (2023). Nanopore sequencing of PCR products enables multicopy gene family reconstruction. Computational and Structural Biotechnology Journal, 21, 3656-3664
Open this publication in new window or tab >>Nanopore sequencing of PCR products enables multicopy gene family reconstruction
Show others...
2023 (English)In: Computational and Structural Biotechnology Journal, E-ISSN 2001-0370, Vol. 21, p. 3656-3664Article in journal (Refereed) Published
Abstract [en]

The importance of gene amplifications in evolution is more and more recognized. Yet, tools to study multi-copy gene families are still scarce, and many such families are overlooked using common sequencing methods. Haplotype reconstruction is even harder for polymorphic multi-copy gene families. Here, we show that all variants (or haplotypes) of a multi-copy gene family present in a single genome, can be obtained using Oxford Nanopore Technologies sequencing of PCR products, followed by steps of mapping, SNP calling and haplotyping. As a proof of concept, we acquired the sequences of highly similar variants of the cidA and cidB genes present in the genome of the Wolbachia wPip, a bacterium infecting Culex pipiens mosquitoes. Our method relies on a wide database of cid genes, previously acquired by cloning and Sanger sequencing. We addressed problems commonly faced when using mapping approaches for multi-copy gene families with highly similar variants. In addition, we confirmed that PCR amplification causes frequent chimeras which have to be carefully considered when working on families of recombinant genes. We tested the robustness of the method using a combination of bioinformatics (read simulations) and molecular biology approaches (sequence acquisitions through cloning and Sanger sequencing, specific PCRs and digital droplet PCR). When different haplotypes present within a single genome cannot be reconstructed from short reads sequencing, this pipeline confers a high throughput acquisition, gives reliable results as well as insights of the relative copy numbers of the different variants.

Keywords
Multi-copy genes, Nanopore sequencing, PCR recombination, Wolbachia
National Category
Bioinformatics and Systems Biology
Identifiers
urn:nbn:se:su:diva-225342 (URN)10.1016/j.csbj.2023.07.012 (DOI)001046093700001 ()2-s2.0-85165127766 (Scopus ID)
Funder
Swedish Research Council, 2021–04000
Available from: 2024-01-15 Created: 2024-01-15 Last updated: 2024-02-12Bibliographically approved
Tomaszkiewicz, M., Sahlin, K., Medvedev, P. & Makova, K. D. (2023). Transcript Isoform Diversity of Ampliconic Genes on the Y Chromosome of Great Apes. Genome Biology and Evolution, 15(11), Article ID evad205.
Open this publication in new window or tab >>Transcript Isoform Diversity of Ampliconic Genes on the Y Chromosome of Great Apes
2023 (English)In: Genome Biology and Evolution, E-ISSN 1759-6653, Vol. 15, no 11, article id evad205Article in journal (Refereed) Published
Abstract [en]

Y chromosomal ampliconic genes (YAGs) are important for male fertility, as they encode proteins functioning in spermatogenesis. The variation in copy number and expression levels of these multicopy gene families has been studied in great apes; however, the diversity of splicing variants remains unexplored. Here, we deciphered the sequences of polyadenylated transcripts of all nine YAG families (BPY2CDYDAZHSFYPRYRBMYTSPYVCY, and XKRY) from testis samples of six great ape species (human, chimpanzee, bonobo, gorilla, Bornean orangutan, and Sumatran orangutan). To achieve this, we enriched YAG transcripts with capture probe hybridization and sequenced them with long (Pacific Biosciences) reads. Our analysis of this data set resulted in several findings. First, we observed evolutionarily conserved alternative splicing patterns for most YAG families except for BPY2 and PRY. Second, our results suggest that BPY2 transcripts and proteins originate from separate genomic regions in bonobo versus human, which is possibly facilitated by acquiring new promoters. Third, our analysis indicates that the PRY gene family, having the highest representation of noncoding transcripts, has been undergoing pseudogenization. Fourth, we have not detected signatures of selection in the five YAG families shared among great apes, even though we identified many species-specific protein-coding transcripts. Fifth, we predicted consensus disorder regions across most gene families and species, which could be used for future investigations of male infertility. Overall, our work illuminates the YAG isoform landscape and provides a genomic resource for future functional studies focusing on infertility phenotypes in humans and critically endangered great apes.

Keywords
transcript isoform, diversity, ampliconic gene, Y chromosome, great apes
National Category
Microbiology Zoology
Identifiers
urn:nbn:se:su:diva-225091 (URN)10.1093/gbe/evad205 (DOI)001109466800004 ()37967251 (PubMedID)2-s2.0-85178499138 (Scopus ID)
Available from: 2024-01-08 Created: 2024-01-08 Last updated: 2024-07-04Bibliographically approved
Pomerantz, A., Sahlin, K., Vasiljevic, N., Seah, A., Lim, M., Humble, E., . . . Prost, S. (2022). Rapid in situ identification of biological specimens via DNA amplicon sequencing using miniaturized laboratory equipment. Nature Protocols, 17(6), 1415-1443
Open this publication in new window or tab >>Rapid in situ identification of biological specimens via DNA amplicon sequencing using miniaturized laboratory equipment
Show others...
2022 (English)In: Nature Protocols, ISSN 1754-2189, E-ISSN 1750-2799, Vol. 17, no 6, p. 1415-1443Article in journal (Refereed) Published
Abstract [en]

In many parts of the world, human-mediated environmental change is depleting biodiversity faster than it can be characterized, while invasive species cause agricultural damage, threaten human health and disrupt native habitats. Consequently, the application of effective approaches for rapid surveillance and identification of biological specimens is increasingly important to inform conservation and biosurveillance efforts. Taxonomic assignments have been greatly advanced using sequence-based applications, such as DNA barcoding, a diagnostic technique that utilizes PCR and DNA sequence analysis of standardized genetic regions. However, in many biodiversity hotspots, endeavors are often hindered by a lack of laboratory infrastructure, funding for biodiversity research and restrictions on the transport of biological samples. A promising development is the advent of low-cost, miniaturized scientific equipment. Such tools can be assembled into functional laboratories to carry out genetic analyses in situ, at local institutions, field stations or classrooms. Here, we outline the steps required to perform amplicon sequencing applications, from DNA isolation to nanopore sequencing and downstream data analysis, all of which can be conducted outside of a conventional laboratory environment using miniaturized scientific equipment, without reliance on Internet connectivity. Depending on sample type, the protocol (from DNA extraction to full bioinformatic analyses) can be completed within 10 h, and with appropriate quality controls can be used for diagnostic identification of samples independent of core genomic facilities that are required for alternative methods. 

National Category
Biological Sciences
Identifiers
urn:nbn:se:su:diva-204500 (URN)10.1038/s41596-022-00682-x (DOI)000780958400002 ()35411044 (PubMedID)2-s2.0-85129143112 (Scopus ID)
Available from: 2022-05-09 Created: 2022-05-09 Last updated: 2022-06-09Bibliographically approved
Cáceres, M., Mumey, B., Husić, E., Rizzi, R., Cairo, M., Sahlin, K. & Tomescu, A. I. (2022). Safety in Multi-Assembly via Paths Appearing in All Path Covers of a DAG. IEEE/ACM Transactions on Computational Biology & Bioinformatics, 19(6), 3673-3684
Open this publication in new window or tab >>Safety in Multi-Assembly via Paths Appearing in All Path Covers of a DAG
Show others...
2022 (English)In: IEEE/ACM Transactions on Computational Biology & Bioinformatics, ISSN 1545-5963, E-ISSN 1557-9964, Vol. 19, no 6, p. 3673-3684Article in journal (Refereed) Published
Abstract [en]

A multi-assembly problem asks to reconstruct multiple genomic sequences from mixed reads sequenced from all of them. Standard formulations of such problems model a solution as a path cover in a directed acyclic graph, namely a set of paths that together cover all vertices of the graph. Since multi-assembly problems admit multiple solutions in practice, we consider an approach commonly used in standard genome assembly: output only partial solutions ( contigs , or safe paths ), that appear in all path cover solutions. We study constrained path covers, a restriction on the path cover solution that incorporate practical constraints arising in multi-assembly problems. We give efficient algorithms finding all maximal safe paths for constrained path covers. We compute the safe paths of splicing graphs constructed from transcript annotations of different species. Our algorithms run in less than 15 seconds per species and report RNA contigs that are over 99% precise and are up to 8 times longer than unitigs. Moreover, RNA contigs cover over 70% of the transcripts and their coding sequences in most cases. With their increased length to unitigs, high precision, and fast construction time, maximal safe paths can provide a better base set of sequences for transcript assembly programs.

Keywords
Graph algorithms, Network problems, Analysis of Algorithms and Problem Complexity, Biology and genetics
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
urn:nbn:se:su:diva-199751 (URN)10.1109/tcbb.2021.3131203 (DOI)000966719600060 ()34847041 (PubMedID)2-s2.0-85120913195 (Scopus ID)
Funder
Academy of Finland
Available from: 2021-12-14 Created: 2021-12-14 Last updated: 2024-06-10Bibliographically approved
Sahlin, K. (2022). Strobealign: flexible seed size enables ultra-fast and accurate read alignment. Genome Biology, 23, Article ID 260.
Open this publication in new window or tab >>Strobealign: flexible seed size enables ultra-fast and accurate read alignment
2022 (English)In: Genome Biology, ISSN 1465-6906, E-ISSN 1474-760X, Vol. 23, article id 260Article in journal (Refereed) Published
Abstract [en]

Read alignment is often the computational bottleneck in analyses. Recently, several advances have been made on seeding methods for fast sequence comparison. We combine two such methods, syncmers and strobemers, in a novel seeding approach for constructing dynamic-sized fuzzy seeds and implement the method in a short-read aligner, strobealign. The seeding is fast to construct and effectively reduces repetitiveness in the seeding step, as shown using a novel metric E-hits. strobealign is several times faster than traditional aligners at similar and sometimes higher accuracy while being both faster and more accurate than more recently proposed aligners for short reads of lengths 150nt and longer. Availability: https://github.com/ksahlin/strobealign

Keywords
Read alignment, Short-reads, Read mapping, Strobemers, Syncmers, Seedand-extend
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:su:diva-214118 (URN)10.1186/s13059-022-02831-7 (DOI)000899620000002 ()36522758 (PubMedID)2-s2.0-85144105617 (Scopus ID)
Funder
The Royal Swedish Academy of Sciences, 2021-04000_VRStockholm UniversityThe Royal Swedish Academy of Sciences, 2021-04000_VRStockholm University
Available from: 2023-01-23 Created: 2023-01-23 Last updated: 2024-05-29Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-7378-2320

Search in DiVA

Show all publications