Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Reconciling gene family evolution and species evolution
Stockholm University, Faculty of Science, Numerical Analysis and Computer Science (NADA). (Computational Biology)ORCID iD: 0000-0002-6470-0239
2013 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Species evolution can often be adequately described with a phylogenetic tree. Interestingly, this is the case also for the evolution of homologous genes; a gene in an ancestral species may – through gene duplication, gene loss, lateral gene transfer (LGT), and speciation events – give rise to a gene family distributed across contemporaneous species. However, molecular sequence evolution and genetic recombination make the history – the gene tree – non-trivial to reconstruct from present-day sequences. This history is of biological interest, e.g., for inferring potential functional equivalences of extant gene pairs.

In this thesis, we present biologically sound probabilistic models for gene family evolution guided by species evolution – effectively yielding a gene-species tree reconciliation. Using Bayesian Markov-chain Monte Carlo (MCMC) inference techniques, we show that by taking advantage of the information provided by the species tree, our methods achieve more reliable gene tree estimates than traditional species tree-uninformed approaches.

Specifically, we describe a comprehensive model that accounts for gene duplication, gene loss, a relaxed molecular clock, and sequence evolution, and we show that the method performs admirably on synthetic and biological data. Further-more, we present two expansions of the inference procedure, enabling it to pro-vide (i) refined gene tree estimates with timed duplications, and (ii) probabilistic orthology estimates – i.e., that the origin of a pair of extant genes is a speciation.

Finally, we present a substantial development of the model to account also for LGT. A sophisticated algorithmic framework of dynamic programming and numerical methods for differential equations is used to resolve the computational hurdles that LGT brings about. We apply the method on two bacterial datasets where LGT is believed to be prominent, in order to estimate genome-wide LGT and duplication rates. We further show that traditional methods – in which gene trees are reconstructed and reconciled with the species tree in separate stages – are prone to yield inferior gene tree estimates that will overestimate the number of LGT events.

Abstract [sv]

Arters evolution kan i många fall beskrivas med ett träd, vilket redan Darwins anteckningsböcker från HMS Beagle vittnar om. Detta gäller också homologa gener; en gen i en ancestral art kan – genom genduplikationer, genförluster, lateral gentransfer (LGT) och artbildningar – ge upphov till en genfamilj spridd över samtida arter. Att från sekvenser från nu levande arter rekonstruera genfamiljens framväxt – genträdet – är icke-trivialt på grund av genetisk rekombination och sekvensevolution. Genträdet är emellertid av biologiskt intresse, i synnerhet för att det möjliggör antaganden om funktionellt släktskap mellan nutida genpar.

Denna avhandling behandlar biologiskt välgrundade sannolikhetsmodeller för genfamiljsevolution. Dessa modeller tar hjälp av artevolutionens starka inverkan på genfamiljens historia, och ger väsentligen upphov till en förlikning av genträd och artträd. Genom Bayesiansk inferens baserad på Markov-chain Monte Carlo (MCMC) visar vi att våra metoder presterar bättre genträdsskattningar än traditionella ansatser som inte tar artträdet i beaktning.

Mer specifikt beskriver vi en modell som omfattar genduplikationer, genförluster, en relaxerad molekylär klocka, samt sekvensevolution, och visar att metoden ger högkvalitativa skattningar på både syntetiska och biologiska data. Vidare presenterar vi två utvidgningar av detta ramverk som möjliggör (i) genträdsskattningar med tidpunkter för duplikationer, samt (ii) probabilistiska ortologiskattningar – d.v.s. att två nutida gener härstammar från en artbildning.

Slutligen presenterar vi en modell som inkluderar LGT utöver ovan nämnda mekanismer. De beräkningsmässiga svårigheter som LGT ger upphov till löses med ett intrikat ramverk av dynamisk programmering och numeriska metoder för differentialekvationer. Vi tillämpar metoden för att skatta LGT- och duplikationsraten hos två bakteriella dataset där LGT förmodas ha spelat en central roll. Vi visar också att traditionella metoder – där genträd skattas och förlikas med artträdet i separata steg – tenderar att ge sämre genträdsskattningar, och därmed överskatta antalet LGT-händelser.

Place, publisher, year, edition, pages
Stockholm: Numerical Analysis and Computer Science (NADA), Stockholm University , 2013. , 59 p.
Keyword [en]
Computational biology, Bioinformatics, Phylogenetics, Phylogenomics, Comparative genomics, Evolutionary biology
National Category
Computer Science
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:su:diva-93346ISBN: 978-91-7447-760-3 (print)OAI: oai:DiVA.org:su-93346DiVA: diva2:653328
Public defence
2013-11-04, Inghesalen, Widerströmska huset, Karolinska Institutet, Tomtebodavägen 18, Solna, 13:30 (English)
Opponent
Supervisors
Note

At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 3: Manuscript. Paper 5: Manuscript.

Available from: 2013-10-13 Created: 2013-09-09 Last updated: 2015-11-30Bibliographically approved
List of papers
1. GenPhyloData: realistic simulation of gene family evolution
Open this publication in new window or tab >>GenPhyloData: realistic simulation of gene family evolution
2013 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 14, 209Article in journal (Refereed) Published
Abstract [en]

Background: PrIME-GenPhyloData is a suite of tools for creating realistic simulated phylogenetic trees, in particular for families of homologous genes. It supports generation of trees based on a birth-death process and-perhaps more interestingly-also supports generation of gene family trees guided by a known (synthetic or biological) species tree while accounting for events such as gene duplication, gene loss, and lateral gene transfer (LGT). The suite also supports a wide range of branch rate models enabling relaxation of the molecular clock. Result: Simulated data created with PrIME-GenPhyloData can be used for benchmarking phylogenetic approaches, or for characterizing models or model parameters with respect to biological data. Conclusion: The concept of tree-in-tree evolution can also be used to model, for instance, biogeography or host-parasite co-evolution.

Keyword
Phylogenetics, Synthetic data, Gene family, Gene duplication, Gene loss, LGT, Molecular clock, Biogeography, Host-parasite co-evolution
National Category
Biochemistry and Molecular Biology Microbiology
Identifiers
urn:nbn:se:su:diva-92509 (URN)10.1186/1471-2105-14-209 (DOI)000321381300001 ()
Note

AuthorCount:4;

Available from: 2013-08-09 Created: 2013-08-07 Last updated: 2017-12-06Bibliographically approved
2. DLRS: Gene tree evolution in light of a species tree
Open this publication in new window or tab >>DLRS: Gene tree evolution in light of a species tree
2012 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 28, no 22, 2994-2995 p.Article in journal (Refereed) Published
Abstract [en]

PrIME-DLRS (or colloquially: 'Delirious') is a phylogenetic software tool to simultaneously infer and reconcile a gene tree given a species tree. It accounts for duplication and loss events, a relaxed molecular clock and is intended for the study of homologous gene families, for example in a comparative genomics setting involving multiple species. PrIME-DLRS uses a Bayesian MCMC framework, where the input is a known species tree with divergence times and a multiple sequence alignment, and the output is a posterior distribution over gene trees and model parameters.

National Category
Bioinformatics (Computational Biology)
Research subject
Numerical Analysis
Identifiers
urn:nbn:se:su:diva-80681 (URN)10.1093/bioinformatics/bts548 (DOI)000311303500022 ()22982573 (PubMedID)
Funder
Swedish Research Council, 2010-4634
Available from: 2012-09-26 Created: 2012-09-26 Last updated: 2017-12-07Bibliographically approved
3. A Bayesian Method for Analyzing Lateral Gene Transfer
Open this publication in new window or tab >>A Bayesian Method for Analyzing Lateral Gene Transfer
Show others...
2014 (English)In: Systematic Biology, ISSN 1063-5157, E-ISSN 1076-836X, Vol. 63, no 3, 409-420 p.Article in journal (Refereed) Published
Abstract [en]

Lateral gene transfer (LGT)uwhich transfers DNA between two non-vertically related individuals belonging to the same or different speciesuis recognized as a major force in prokaryotic evolution, and evidence of its impact on eukaryotic evolution is ever increasing. LGT has attracted much public attention for its potential to transfer pathogenic elements and antibiotic resistance in bacteria, and to transfer pesticide resistance from genetically modified crops to other plants. In a wider perspective, there is a growing body of studies highlighting the role of LGT in enabling organisms to occupy new niches or adapt to environmental changes. The challenge LGT poses to the standard tree-based conception of evolution is also being debated. Studies of LGT have, however, been severely limited by a lack of computational tools. The best currently available LGT algorithms are parsimony-based phylogenetic methods, which require a pre-computed gene tree and cannot choose between sometimes wildly differing most parsimonious solutions. Moreover, in many studies, simple heuristics are applied that can only handle putative orthologs and completely disregard gene duplications (GDs). Consequently, proposed LGT among specific gene families, and the rate of LGT in general, remain debated. We present a Bayesian Markov-chain Monte Carlo-based method that integrates GD, gene loss, LGT, and sequence evolution, and apply the method in a genome-wide analysis of two groups of bacteria: Mollicutes and Cyanobacteria. Our analyses show that although the LGT rate between distant species is high, the net combined rate of duplication and close-species LGT is on average higher. We also show that the common practice of disregarding reconcilability in gene tree inference overestimates the number of LGT and duplication events. [Bayesian; gene duplication; gene loss; horizontal gene transfer; lateral gene transfer; MCMC; phylogenetics.].

Keyword
Bayesian, gene duplication, gene loss, horizontal gene transfer, lateral gene transfer, MCMC, phylogenetics
National Category
Developmental Biology Computer Science
Research subject
Computer Science
Identifiers
urn:nbn:se:su:diva-104137 (URN)10.1093/sysbio/syu007 (DOI)000334752600010 ()
Note

AuthorCount:6;

Available from: 2014-06-04 Created: 2014-06-03 Last updated: 2017-12-05Bibliographically approved
4. Genome-wide probabilistic reconciliation analysis across vertebrates
Open this publication in new window or tab >>Genome-wide probabilistic reconciliation analysis across vertebrates
2013 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 14, no Suppl 15, S10- p.Article in journal (Refereed) Published
Abstract [en]

Gene duplication is considered to be a major driving force in evolution that enables the genome of a species to acquire new functions. A reconciliation - a mapping of gene tree vertices to the edges or vertices of a species tree - explains where gene duplications have occurred on the species tree. In this study, we sample reconciliations from a posterior over reconciliations, gene trees, edge lengths and other parameters, given a species tree and gene sequences. We employ a Bayesian analysis tool, based on the probabilistic model DLRS that integrates gene duplication, gene loss and sequence evolution under a relaxed molecular clock for substitution rates, to obtain this posterior.

By applying these methods, we perform a genome-wide analysis of a nine species dataset, OPTIC, and conclude that for many gene families, the most parsimonious reconciliation (MPR) - a reconciliation that minimizes the number of duplications - is far from the correct explanation of the evolutionary history. For the given dataset, we observe that approximately 19% of the sampled reconciliations are different from MPR. This is in clear contrast with previous estimates, based on simpler models and less realistic assumptions, according to which 98% of the reconciliations can be expected to be identical to MPR. We also generate heatmaps showing where in the species trees duplications have been most frequent during the evolution of these species.

Place, publisher, year, edition, pages
BioMed Central, 2013
National Category
Biochemistry and Molecular Biology Microbiology Mathematical Analysis
Identifiers
urn:nbn:se:su:diva-94358 (URN)10.1186/1471-2105-14-S15-S10 (DOI)000328316700010 ()
Conference
11th Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative GenomicsLyon, FRANCE, OCT 17-19, 2013
Note

AuthorCount: 4;

Available from: 2013-10-03 Created: 2013-10-03 Last updated: 2017-12-06Bibliographically approved
5. Integrating Sequence Evolution into Probabilistic Orthology Analysis
Open this publication in new window or tab >>Integrating Sequence Evolution into Probabilistic Orthology Analysis
Show others...
2015 (English)In: Systematic Biology, ISSN 1063-5157, E-ISSN 1076-836X, Vol. 64, no 6, 969-982 p.Article in journal (Refereed) Published
Abstract [en]

Orthology analysis, that is, finding out whether a pair of homologous genes are orthologs - stemming from a speciation - or paralogs - stemming from a gene duplication - is of central importance in computational biology, genome annotation, and phylogenetic inference. In particular, an orthologous relationship makes functional equivalence of the two genes highly likely. A major approach to orthology analysis is to reconcile a gene tree to the corresponding species tree, (most commonly performed using the most parsimonious reconciliation, MPR). However, most such phylogenetic orthology methods infer the gene tree without considering the constraints implied by the species tree and, perhaps even more importantly, only allow the gene sequences to influence the orthology analysis through the a priori reconstructed gene tree. We propose a sound, comprehensive Bayesian Markov chain Monte Carlo-based method, DLRSOrthology, to compute orthology probabilities. It efficiently sums over the possible gene trees and jointly takes into account the current gene tree, all possible reconciliations to the species tree, and the, typically strong, signal conveyed by the sequences. We compare our method with PrIME-GEM, a probabilistic orthology approach built on a probabilistic duplication-loss model, and MRBAYESMPR, a probabilistic orthology approach that is based on conventional Bayesian inference coupled with MPR. We find that DLRSOrthology outperforms these competing approaches on synthetic data as well as on biological data sets and is robust to incomplete taxon sampling artifacts.

Keyword
Comparative genomics, gene duplication, gene loss, orthology, paralogy, phylogenetics, probabilistic modeling, relaxed molecular clock, sequence evolution, tree realization, tree reconciliation
National Category
Biological Sciences Computer Science
Research subject
Computer Science
Identifiers
urn:nbn:se:su:diva-123513 (URN)10.1093/sysbio/syv044 (DOI)000363168100007 ()
Available from: 2015-11-30 Created: 2015-11-27 Last updated: 2017-12-01Bibliographically approved

Open Access in DiVA

fulltext(11335 kB)426 downloads
File information
File name FULLTEXT01.pdfFile size 11335 kBChecksum SHA-512
2587dab72d8795bed2117afec0d5e84b2b2044b92237c5bbbcf9e5af636c01528645da6e753a1e7868205f1829e6a5c9e3e09f892eac7d75e22095fece9af2ce
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Sjöstrand, Joel
By organisation
Numerical Analysis and Computer Science (NADA)
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 426 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 901 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf