Change search
Link to record
Permanent link

Direct link
Publications (10 of 19) Show all publications
Tolstoganov, I., Martin, M., Buchin, N. & Sahlin, K. (2026). Multi-context seeds enable fast and high-accuracy read mapping. Genome Biology, 27(1), Article ID 118.
Open this publication in new window or tab >>Multi-context seeds enable fast and high-accuracy read mapping
2026 (English)In: Genome Biology, ISSN 1465-6906, E-ISSN 1474-760X, Vol. 27, no 1, article id 118Article in journal (Refereed) Published
Abstract [en]

A key step in sequence similarity search is to identify shared seeds between a query and a reference sequence. A well-known tradeoff is that longer seeds offer fast searches but reduce sensitivity in variable regions. We introduce multi-context seeds (MCS), which allow the storage of seeds with different lengths in the same index structure, thus retaining the advantages of both short and long seeds. We demonstrate the applicability of MCS by implementing them in strobealign. Strobealign with MCS substantially improves accuracy compared to the previous version with little cost in runtime and no memory overhead.

Keywords
Illumina, K-mers, Read mapping, Seeds, Strobemers
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:su:diva-254465 (URN)10.1186/s13059-026-04017-x (DOI)001734636100001 ()41764549 (PubMedID)2-s2.0-105035323888 (Scopus ID)
Available from: 2026-04-22 Created: 2026-04-22 Last updated: 2026-04-22Bibliographically approved
Petri, A. J., Thi-Huyen Nguyen, M., Rajwar, A., Benson, E. & Sahlin, K. (2025). cONcat: Computational reconstruction of concatenated fragments from long Oxford Nanopore reads.
Open this publication in new window or tab >>cONcat: Computational reconstruction of concatenated fragments from long Oxford Nanopore reads
Show others...
2025 (English)In: Article in journal (Other academic) Submitted
Abstract [en]

Synthetic combinatorial DNA libraries are widely used to produce protein variants, optimize binders, and for high throughput studies of protein - DNA interactions. The libraries can be made by researchers or vendors and high-throughput sequencing is used for both quality control and to study the outcome of selection experiments. Oxford nanopore sequencing (ONT) is well suited to this as it allows for long read lengths and can be done rapidly with low-cost instrumentation. However, it suffers from a lower overall read accuracy and an uneven error profile. No current bioinformatics tools are well suited to the challenge of deducing the composition and order of constituent members of combinatorial libraries from ONT reads.

We introduce cONcat, an algorithm to identify the makeup of concatenated DNA fragments in a set of ONT sequencing reads from a pool of known fragments. cONcat uses the edit distance-based recursive covering algorithm for finding the best possible matchings between the fragments and the reads. In our experiments on simulated and experimental data, cONcat could accurately detect the correct fragment coverings given the short fragment sizes (< 20bp) and the sequencing errors present in ONT reads. However, we find that the high error rates in the start of ONT reads make it challenging to get confident coverage there, inferring a need for experimental strategies to avoid key sequence information in the start of reads.

National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:su:diva-243270 (URN)10.1101/2025.03.05.641699 (DOI)
Available from: 2025-05-20 Created: 2025-05-20 Last updated: 2025-06-05
Petri, A. J. & Sahlin, K. (2025). De novo clustering of large long-read transcriptome datasets with isONclust3. Bioinformatics, 41(5), Article ID btaf207.
Open this publication in new window or tab >>De novo clustering of large long-read transcriptome datasets with isONclust3
2025 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 41, no 5, article id btaf207Article in journal (Refereed) Published
Abstract [en]

Motivation

Long-read sequencing techniques can sequence transcripts from end to end, greatly improving our ability to study the transcription process. Although there are several well-established tools for long-read transcriptome analysis, most are reference-based. This limits the analysis of organisms without high-quality reference genomes and samples or genes with high variability (e.g. cancer samples or some gene families). In such settings, analysis using a reference-free method is favorable. The computational problem of clustering long reads by region of common origin is well-established for reference-free transcriptome analysis pipelines. Such clustering enables large datasets to be split roughly by gene family and, therefore, an independent analysis of each cluster. There exist tools for this. However, none of those tools can efficiently process the large amount of reads that are now generated by long-read sequencing technologies.

Results

We present isONclust3, an improved algorithm over isONclust and isONclust2, to cluster massive long-read transcriptome datasets into gene families. Like isONclust, isONclust3 represents each cluster with a set of minimizers. However, unlike other approaches, isONclust3 dynamically updates the cluster representation during clustering by adding high-confidence minimizers from new reads assigned to the cluster and employs an iterative cluster-merging step. We show that isONclust3 yields results with higher or comparable quality to state-of-the-art algorithms but is 10–100 times faster on large datasets. Also, using a 256 Gb computing node, isONclust3 was the only tool that could cluster 37 million PacBio reads, which is a typical throughput of the recent PacBio Revio sequencing machine.

National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:su:diva-243269 (URN)10.1093/bioinformatics/btaf207 (DOI)001483472300001 ()40265453 (PubMedID)2-s2.0-105004673060 (Scopus ID)
Funder
Swedish Research Council, 2021–04000
Available from: 2025-05-20 Created: 2025-05-20 Last updated: 2025-06-02Bibliographically approved
Karami, M., Mohammadi, A. S., Martin, M., Ekim, B., Shen, W., Guo, L., . . . Sahlin, K. (2024). Designing efficient randstrobes for sequence similarity analyses. Bioinformatics, 40(4), Article ID btae187.
Open this publication in new window or tab >>Designing efficient randstrobes for sequence similarity analyses
Show others...
2024 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 40, no 4, article id btae187Article in journal (Refereed) Published
Abstract [en]

Motivation: Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis. However, k-mers are limited to exact matches between sequences leading to alternative constructs. We recently introduced a class of new constructs, strobemers, that can match across substitutions and smaller insertions and deletions. Randstrobes, the most sensitive strobemer proposed in Sahlin (Effective sequence similarity detection with strobemers. Genome Res 2021a;31:2080–94. https://doi.org/10.1101/gr.275648.121), has been used in several bioinformatics applications such as read classification, short-read mapping, and read overlap detection. Recently, we showed that the more pseudo-random the behavior of the construction (measured in entropy), the more efficient the seeds for sequence similarity analysis. The level of pseudo-randomness depends on the construction operators, but no study has investigated the efficacy.

Results: In this study, we introduce novel construction methods, including a Binary Search Tree-based approach that improves time complexity over previous methods. To our knowledge, we are also the first to address biases in construction and design three metrics for measuring bias. Our evaluation shows that our methods have favorable speed and sampling uniformity compared to existing approaches. Lastly, guided by our results, we change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. We suggest combining the two results to improve strobealign’s accuracy for the shortest reads in our evaluated datasets. Our evaluation highlights sampling biases that can occur and provides guidance on which operators to use when implementing randstrobes.

Availability and implementation: All methods and evaluation benchmarks are available in a public Github repository at https://github.com/Moein-Karami/RandStrobes. The scripts for running the strobealign analysis are found at https://github.com/NBISweden/strobealign-evaluation.

National Category
Bioinformatics (Computational Biology) Construction Management
Identifiers
urn:nbn:se:su:diva-229044 (URN)10.1093/bioinformatics/btae187 (DOI)001206629000004 ()38579261 (PubMedID)2-s2.0-85191199242 (Scopus ID)
Available from: 2024-05-20 Created: 2024-05-20 Last updated: 2026-03-12Bibliographically approved
Baudeau, T. & Sahlin, K. (2024). Improved sub-genomic RNA prediction with the ARTIC protocol. Nucleic Acids Research, 52(17), Article ID e82.
Open this publication in new window or tab >>Improved sub-genomic RNA prediction with the ARTIC protocol
2024 (English)In: Nucleic Acids Research, ISSN 0305-1048, E-ISSN 1362-4962, Vol. 52, no 17, article id e82Article in journal (Refereed) Published
Abstract [en]

Viral subgenomic RNA (sgRNA) plays a major role in SARS-COV2's replication, pathogenicity, and evolution. Recent sequencing protocols, such as the ARTIC protocol, have been established. However, due to the viral-specific biological processes, analyzing sgRNA through viral-specific read sequencing data is a computational challenge. Current methods rely on computational tools designed for eukaryote genomes, resulting in a gap in the tools designed specifically for sgRNA detection. To address this, we make two contributions. Firstly, we present sgENERATE, an evaluation pipeline to study the accuracy and efficacy of sgRNA detection tools using the popular ARTIC sequencing protocol. Using sgENERATE, we evaluate periscope, a recently introduced tool that detects sgRNA from ARTIC sequencing data. We find that periscope has biased predictions and high computational costs. Secondly, using the information produced from sgENERATE, we redesign the algorithm in periscope to use multiple references from canonical sgRNAs to mitigate alignment issues and improve sgRNA and non-canonical sgRNA detection. We evaluate periscope and our algorithm, periscope_multi, on simulated and biological sequencing datasets and demonstrate periscope_multi's enhanced sgRNA detection accuracy. Our contribution advances tools for studying viral sgRNA, paving the way for more accurate and efficient analyses in the context of viral RNA discovery.

National Category
Computational Mathematics
Identifiers
urn:nbn:se:su:diva-237713 (URN)10.1093/nar/gkae687 (DOI)001291399900001 ()39149898 (PubMedID)2-s2.0-85204759367 (Scopus ID)
Available from: 2025-01-10 Created: 2025-01-10 Last updated: 2025-10-03Bibliographically approved
Sahlin, K., Baudeau, T., Cazaux, B. & Marchet, C. (2023). A survey of mapping algorithms in the long-reads era. Genome Biology, 24(1), Article ID 133.
Open this publication in new window or tab >>A survey of mapping algorithms in the long-reads era
2023 (English)In: Genome Biology, ISSN 1465-6906, E-ISSN 1474-760X, Vol. 24, no 1, article id 133Article, review/survey (Refereed) Published
Abstract [en]

It has been over a decade since the first publication of a method dedicated entirely to mapping long-reads. The distinctive characteristics of long reads resulted in methods moving from the seed-and-extend framework used for short reads to a seed-and-chain framework due to the seed abundance in each read. The main novelties are based on alternative seed constructs or chaining formulations. Dozens of tools now exist, whose heuristics have evolved considerably. We provide an overview of the methods used in long-read mappers. Since they are driven by implementation-specific parameters, we develop an original visualization tool to understand the parameter settings (http:// bcazaux.polytech-lille.net/Minimap2/).

National Category
Environmental Biotechnology Biological Sciences
Identifiers
urn:nbn:se:su:diva-218376 (URN)10.1186/s13059-023-02972-3 (DOI)001000395100002 ()37264447 (PubMedID)2-s2.0-85160969471 (Scopus ID)
Available from: 2023-06-27 Created: 2023-06-27 Last updated: 2023-06-27Bibliographically approved
Ekim, B., Sahlin, K., Medvedev, P., Berger, B. & Chikhi, R. (2023). Efficient mapping of accurate long reads in minimizer space with mapquik. Genome Research, 33(7), 1188-1197
Open this publication in new window or tab >>Efficient mapping of accurate long reads in minimizer space with mapquik
Show others...
2023 (English)In: Genome Research, ISSN 1088-9051, E-ISSN 1549-5469, Vol. 33, no 7, p. 1188-1197Article in journal (Refereed) Published
Abstract [en]

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps-fundamental bottlenecks to read mapping-for both the human and maize genomes with >96% sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a 37x speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a 410x speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic O(n) pseudochaining algorithm, which improves upon the long-standing O(nlogn) bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.

National Category
Bioinformatics and Computational Biology
Identifiers
urn:nbn:se:su:diva-221725 (URN)10.1101/gr.277679.123 (DOI)001059942600001 ()37399256 (PubMedID)2-s2.0-85167896090 (Scopus ID)
Available from: 2023-09-28 Created: 2023-09-28 Last updated: 2025-02-07Bibliographically approved
Maier, B. D. & Sahlin, K. (2023). Entropy predicts sensitivity of pseudorandom seeds. Genome Research, 33(7), 1162-1174
Open this publication in new window or tab >>Entropy predicts sensitivity of pseudorandom seeds
2023 (English)In: Genome Research, ISSN 1088-9051, E-ISSN 1549-5469, Vol. 33, no 7, p. 1162-1174Article in journal (Refereed) Published
Abstract [en]

Seed design is important for sequence similarity search applications such as read mapping and average nucleotide identity (ANI) estimation. Although k-mers and spaced k-mers are likely the most well-known and used seeds, sensitivity suffers at high error rates, particularly when indels are present. Recently, we developed a pseudorandom seeding construct, strobemers, which was empirically shown to have high sensitivity also at high indel rates. However, the study lacked a deeper understanding of why. In this study, we propose a model to estimate the entropy of a seed and find that seeds with high entropy, according to our model, in most cases have high match sensitivity. Our discovered seed randomness–sensitivity relationship explains why some seeds perform better than others, and the relationship provides a framework for designing even more sensitive seeds. We also present three new strobemer seed constructs: mixedstrobes, altstrobes, and multistrobes. We use both simulated and biological data to show that our new seed constructs improve sequence-matching sensitivity to other strobemers. We show that the three new seed constructs are useful for read mapping and ANI estimation. For read mapping, we implement strobemers into minimap2 and observe 30% faster alignment time and 0.2% higher accuracy than using k-mers when mapping reads at high error rates. As for ANI estimation, we find that higher entropy seeds have a higher rank correlation between estimated and true ANI.

National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:su:diva-225341 (URN)10.1101/gr.277645.123 (DOI)37217253 (PubMedID)2-s2.0-85168804709 (Scopus ID)
Funder
Swedish Research Council, 2018-05973Swedish Research CouncilSwedish Research Council, 2021-04000
Available from: 2024-01-15 Created: 2024-01-15 Last updated: 2024-02-09Bibliographically approved
Petri, A. J. & Sahlin, K. (2023). isONform: reference-free transcriptome reconstruction from Oxford Nanopore data. Bioinformatics, 39, i222-i231
Open this publication in new window or tab >>isONform: reference-free transcriptome reconstruction from Oxford Nanopore data
2023 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 39, p. i222-i231Article in journal (Refereed) Published
Abstract [en]

Motivation With advances in long-read transcriptome sequencing, we can now fully sequence transcripts, which greatly improves our ability to study transcription processes. A popular long-read transcriptome sequencing technique is Oxford Nanopore Technologies (ONT), which through its cost-effective sequencing and high throughput, has the potential to characterize the transcriptome in a cell. However, due to transcript variability and sequencing errors, long cDNA reads need substantial bioinformatic processing to produce a set of isoform predictions from the reads. Several genome and annotation-based methods exist to produce transcript predictions. However, such methods require high-quality genomes and annotations and are limited by the accuracy of long-read splice aligners. In addition, gene families with high heterogeneity may not be well represented by a reference genome and would benefit from reference-free analysis. Reference-free methods to predict transcripts from ONT, such as RATTLE, exist, but their sensitivity is not comparable to reference-based approaches.Results We present isONform, a high-sensitivity algorithm to construct isoforms from ONT cDNA sequencing data. The algorithm is based on iterative bubble popping on gene graphs built from fuzzy seeds from the reads. Using simulated, synthetic, and biological ONT cDNA data, we show that isONform has substantially higher sensitivity than RATTLE albeit with some loss in precision. On biological data, we show that isONform's predictions have substantially higher consistency with the annotation-based method StringTie2 compared with RATTLE. We believe isONform can be used both for isoform construction for organisms without well-annotated genomes and as an orthogonal method to verify predictions of reference-based methods.Availability and implementation

National Category
Biological Sciences Environmental Biotechnology Computer and Information Sciences Mathematics
Identifiers
urn:nbn:se:su:diva-220840 (URN)10.1093/bioinformatics/btad264 (DOI)001027457000029 ()37387174 (PubMedID)2-s2.0-85163651809 (Scopus ID)
Available from: 2023-09-14 Created: 2023-09-14 Last updated: 2025-05-20Bibliographically approved
Namias, A., Sahlin, K., Makoundou, P., Bonnici, I., Sicard, M., Belkhir, K. & Weill, M. (2023). Nanopore sequencing of PCR products enables multicopy gene family reconstruction. Computational and Structural Biotechnology Journal, 21, 3656-3664
Open this publication in new window or tab >>Nanopore sequencing of PCR products enables multicopy gene family reconstruction
Show others...
2023 (English)In: Computational and Structural Biotechnology Journal, E-ISSN 2001-0370, Vol. 21, p. 3656-3664Article in journal (Refereed) Published
Abstract [en]

The importance of gene amplifications in evolution is more and more recognized. Yet, tools to study multi-copy gene families are still scarce, and many such families are overlooked using common sequencing methods. Haplotype reconstruction is even harder for polymorphic multi-copy gene families. Here, we show that all variants (or haplotypes) of a multi-copy gene family present in a single genome, can be obtained using Oxford Nanopore Technologies sequencing of PCR products, followed by steps of mapping, SNP calling and haplotyping. As a proof of concept, we acquired the sequences of highly similar variants of the cidA and cidB genes present in the genome of the Wolbachia wPip, a bacterium infecting Culex pipiens mosquitoes. Our method relies on a wide database of cid genes, previously acquired by cloning and Sanger sequencing. We addressed problems commonly faced when using mapping approaches for multi-copy gene families with highly similar variants. In addition, we confirmed that PCR amplification causes frequent chimeras which have to be carefully considered when working on families of recombinant genes. We tested the robustness of the method using a combination of bioinformatics (read simulations) and molecular biology approaches (sequence acquisitions through cloning and Sanger sequencing, specific PCRs and digital droplet PCR). When different haplotypes present within a single genome cannot be reconstructed from short reads sequencing, this pipeline confers a high throughput acquisition, gives reliable results as well as insights of the relative copy numbers of the different variants.

Keywords
Multi-copy genes, Nanopore sequencing, PCR recombination, Wolbachia
National Category
Bioinformatics and Computational Biology
Identifiers
urn:nbn:se:su:diva-225342 (URN)10.1016/j.csbj.2023.07.012 (DOI)001046093700001 ()2-s2.0-85165127766 (Scopus ID)
Funder
Swedish Research Council, 2021–04000
Available from: 2024-01-15 Created: 2024-01-15 Last updated: 2025-02-07Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-7378-2320

Search in DiVA

Show all publications