1234 4 of 4
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Computational methods for long-read sequencing data analysis
Stockholm University, Faculty of Science, Department of Mathematics.ORCID iD: 0009-0005-9397-0341
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This thesis presents algorithms developed for long-read sequencing techniques, which, since their introduction in the 2010’s become increasingly important approaches in modern bioscientific research. The first two papers cover the development of algorithms for our de novo transcriptome prediction pipeline, the isON pipeline, while the third paper describes an algorithm used for biotechnological analysis of ligated fragments. Paper I introduces isONform, an algorithm capable of predicting different gene products, called isoforms, from a set of long reads sequenced from complementary DNA without the need to rely on a reference or annotation. IsONform is a tool that is part of a larger long-read transcriptome pipeline, isON pipeline, that consists of clustering and error correction steps prior to the isoform prediction. The isONform algorithm is based on the construction of a directed acyclic graph with minimizer-pairs as nodes and connecting neighboring minimizer-pairs on the reads with edges. The algorithm then employs an iterative bubble-popping scheme to merge nodes to ultimately follow all distinct paths through the graph generating the final isoform predictions. The algorithm has been shown to outperform existing state-of-the-art algorithms, while showing comparable results to approaches requiring information of a reference genome and an annotation. Paper II introduces isONclust3, an algorithm used for clustering transcriptomic reads by gene family. The algorithm constitutes the first step employed in pipelines for reference-free prediction of isoforms. The algorithm is based on the minimizer indexing scheme with its novelties being a dynamic clustering approach, assessing and storing minimizers by confidence, and an iterative post-cluster merging step. The algorithm has been shown to scale better, in terms of runtime and memory usage, on large datasets than existing methods while yielding comparable or even better results with respect to clustering quality assessments. We demonstrate that isONclust3 is the only algorithm that can process the clustering of PacBio’s new Revio datasets with tens of millions of reads using typical cluster computing resources (256Gb RAM). These algorithms help to improve the accuracy and efficiency of transcriptomic analysis based on long-read techniques, which is crucial for understanding complex biological systems and diseases. Paper III presents an algorithmic solution, cONcat, to the detection of concatenated fragments in long-read sequencing reads with typical error profiles. The algorithm is based on a greedy heuristic that employs the edit distance measure to find best-fitting fragments and divides the sequence around those points to search for fragment hits on the remaining areas of the read. The algorithm has been shown to be resilient to errors in the data and to be scalable on large numbers of reads.

Place, publisher, year, edition, pages
Stockholm: Department of Mathematics, Stockholm University , 2025. , p. 46
National Category
Bioinformatics (Computational Biology)
Research subject
Computational Mathematics
Identifiers
URN: urn:nbn:se:su:diva-243271ISBN: 978-91-8107-294-5 (print)ISBN: 978-91-8107-295-2 (electronic)OAI: oai:DiVA.org:su-243271DiVA, id: diva2:1959538
Public defence
2025-08-27, Lärosal 10, vån 2, hus 2, Albano, Albanovägen 18, Stockholm, 13:00 (English)
Opponent
Supervisors
Available from: 2025-06-03 Created: 2025-05-20 Last updated: 2025-05-23Bibliographically approved
List of papers
1. cONcat: Computational reconstruction of concatenated fragments from long Oxford Nanopore reads
Open this publication in new window or tab >>cONcat: Computational reconstruction of concatenated fragments from long Oxford Nanopore reads
Show others...
2025 (English)In: Article in journal (Other academic) Submitted
Abstract [en]

Synthetic combinatorial DNA libraries are widely used to produce protein variants, optimize binders, and for high throughput studies of protein - DNA interactions. The libraries can be made by researchers or vendors and high-throughput sequencing is used for both quality control and to study the outcome of selection experiments. Oxford nanopore sequencing (ONT) is well suited to this as it allows for long read lengths and can be done rapidly with low-cost instrumentation. However, it suffers from a lower overall read accuracy and an uneven error profile. No current bioinformatics tools are well suited to the challenge of deducing the composition and order of constituent members of combinatorial libraries from ONT reads.

We introduce cONcat, an algorithm to identify the makeup of concatenated DNA fragments in a set of ONT sequencing reads from a pool of known fragments. cONcat uses the edit distance-based recursive covering algorithm for finding the best possible matchings between the fragments and the reads. In our experiments on simulated and experimental data, cONcat could accurately detect the correct fragment coverings given the short fragment sizes (< 20bp) and the sequencing errors present in ONT reads. However, we find that the high error rates in the start of ONT reads make it challenging to get confident coverage there, inferring a need for experimental strategies to avoid key sequence information in the start of reads.

National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:su:diva-243270 (URN)10.1101/2025.03.05.641699 (DOI)
Available from: 2025-05-20 Created: 2025-05-20 Last updated: 2025-06-05
2. De novo clustering of large long-read transcriptome datasets with isONclust3
Open this publication in new window or tab >>De novo clustering of large long-read transcriptome datasets with isONclust3
2025 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 41, no 5, article id btaf207Article in journal (Refereed) Published
Abstract [en]

Motivation

Long-read sequencing techniques can sequence transcripts from end to end, greatly improving our ability to study the transcription process. Although there are several well-established tools for long-read transcriptome analysis, most are reference-based. This limits the analysis of organisms without high-quality reference genomes and samples or genes with high variability (e.g. cancer samples or some gene families). In such settings, analysis using a reference-free method is favorable. The computational problem of clustering long reads by region of common origin is well-established for reference-free transcriptome analysis pipelines. Such clustering enables large datasets to be split roughly by gene family and, therefore, an independent analysis of each cluster. There exist tools for this. However, none of those tools can efficiently process the large amount of reads that are now generated by long-read sequencing technologies.

Results

We present isONclust3, an improved algorithm over isONclust and isONclust2, to cluster massive long-read transcriptome datasets into gene families. Like isONclust, isONclust3 represents each cluster with a set of minimizers. However, unlike other approaches, isONclust3 dynamically updates the cluster representation during clustering by adding high-confidence minimizers from new reads assigned to the cluster and employs an iterative cluster-merging step. We show that isONclust3 yields results with higher or comparable quality to state-of-the-art algorithms but is 10–100 times faster on large datasets. Also, using a 256 Gb computing node, isONclust3 was the only tool that could cluster 37 million PacBio reads, which is a typical throughput of the recent PacBio Revio sequencing machine.

National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:su:diva-243269 (URN)10.1093/bioinformatics/btaf207 (DOI)001483472300001 ()40265453 (PubMedID)2-s2.0-105004673060 (Scopus ID)
Funder
Swedish Research Council, 2021–04000
Available from: 2025-05-20 Created: 2025-05-20 Last updated: 2025-06-02Bibliographically approved
3. isONform: reference-free transcriptome reconstruction from Oxford Nanopore data
Open this publication in new window or tab >>isONform: reference-free transcriptome reconstruction from Oxford Nanopore data
2023 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 39, p. i222-i231Article in journal (Refereed) Published
Abstract [en]

Motivation With advances in long-read transcriptome sequencing, we can now fully sequence transcripts, which greatly improves our ability to study transcription processes. A popular long-read transcriptome sequencing technique is Oxford Nanopore Technologies (ONT), which through its cost-effective sequencing and high throughput, has the potential to characterize the transcriptome in a cell. However, due to transcript variability and sequencing errors, long cDNA reads need substantial bioinformatic processing to produce a set of isoform predictions from the reads. Several genome and annotation-based methods exist to produce transcript predictions. However, such methods require high-quality genomes and annotations and are limited by the accuracy of long-read splice aligners. In addition, gene families with high heterogeneity may not be well represented by a reference genome and would benefit from reference-free analysis. Reference-free methods to predict transcripts from ONT, such as RATTLE, exist, but their sensitivity is not comparable to reference-based approaches.Results We present isONform, a high-sensitivity algorithm to construct isoforms from ONT cDNA sequencing data. The algorithm is based on iterative bubble popping on gene graphs built from fuzzy seeds from the reads. Using simulated, synthetic, and biological ONT cDNA data, we show that isONform has substantially higher sensitivity than RATTLE albeit with some loss in precision. On biological data, we show that isONform's predictions have substantially higher consistency with the annotation-based method StringTie2 compared with RATTLE. We believe isONform can be used both for isoform construction for organisms without well-annotated genomes and as an orthogonal method to verify predictions of reference-based methods.Availability and implementation

National Category
Biological Sciences Environmental Biotechnology Computer and Information Sciences Mathematics
Identifiers
urn:nbn:se:su:diva-220840 (URN)10.1093/bioinformatics/btad264 (DOI)001027457000029 ()37387174 (PubMedID)2-s2.0-85163651809 (Scopus ID)
Available from: 2023-09-14 Created: 2023-09-14 Last updated: 2025-05-20Bibliographically approved

Open Access in DiVA

Computational methods for long-read sequencing data analysis(4186 kB)46 downloads
File information
File name FULLTEXT01.pdfFile size 4186 kBChecksum SHA-512
1db3e7259c4d2e70d6e04e696330ba6342c3762493829300c1488a90d74bd8f5b36a865f22294b73a773a0bf80de405d4780565b64a39a5dbdaea1717d8eea62
Type fulltextMimetype application/pdf

Authority records

Petri, Alexander J.

Search in DiVA

By author/editor
Petri, Alexander J.
By organisation
Department of Mathematics
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
Total: 46 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 362 hits
1234 4 of 4
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf