Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
The relationship between orthology, protein domain architecture and protein function
Stockholm University, Faculty of Science, Department of Biochemistry and Biophysics. (Stockholm Bioinformatics Centre)
2011 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Lacking experimental data, protein function is often predicted from evolutionary and protein structure theory. Under the 'domain grammar' hypothesis the function of a protein follows from the domains it encodes. Under the 'orthology conjecture', orthologs, related through species formation, are expected to be more functionally similar than paralogs, which are homologs in the same or different species descended from a gene duplication event. However, these assumptions have not thus far been systematically evaluated.

To test the 'domain grammar' hypothesis, we built models for predicting function from the domain combinations present in a protein, and demonstrated that multi-domain combinations imply functions that the individual domains do not. We also developed a novel gene-tree based method for reconstructing the evolutionary histories of domain architectures, to search for cases of architectures that have arisen multiple times in parallel, and found this to be more common than previously reported.

To test the 'orthology conjecture', we first benchmarked methods for homology inference under the obfuscating influence of low-complexity regions, in order to improve the InParanoid orthology inference algorithm. InParanoid was then used to test the relative conservation of functionally relevant properties between orthologs and paralogs at various evolutionary distances, including intron positions, domain architectures, and Gene Ontology functional annotations.

We found an increased conservation of domain architectures in orthologs relative to paralogs, in support of the 'orthology conjecture' and the 'domain grammar' hypotheses acting in tandem. However, equivalent analysis of Gene Ontology functional conservation yielded spurious results, which may be an artifact of species-specific annotation biases in functional annotation databases. I discuss possible ways of circumventing this bias so the 'orthology conjecture' can be tested more conclusively.

Place, publisher, year, edition, pages
Stockholm: Department of Biochemistry and Biophysics, Stockholm University , 2011. , 112 p.
Keyword [en]
homology, orthology, paralogy, gene duplications, protein function prediction, low-complexity regions, protein domains, domain architecture evolution, introns, intron position conservation, orthology conjecture, domain grammar hypothesis
National Category
Bioinformatics and Systems Biology
Research subject
Biochemistry with Emphasis on Theoretical Chemistry
Identifiers
URN: urn:nbn:se:su:diva-62152ISBN: 978-91-7447-350-6 (print)OAI: oai:DiVA.org:su-62152DiVA: diva2:440846
Public defence
2011-10-24, Magnélisalen, Kemiska övningslaboratoriet, Svante Arrhenius väg 16 B, Stockholm, 14:00 (English)
Opponent
Supervisors
Note
At the time of the doctoral defense, the following paper was unpublished and had a status as follows: Paper 6: Epub ahead of print.Available from: 2011-10-02 Created: 2011-09-09 Last updated: 2011-10-06Bibliographically approved
List of papers
1. Domain tree-based analysis of protein architecture evolution
Open this publication in new window or tab >>Domain tree-based analysis of protein architecture evolution
2008 (English)In: Molecular biology and evolution, ISSN 0737-4038, E-ISSN 1537-1719, Vol. 25, no 2, 254-264 p.Article in journal (Refereed) Published
Abstract [en]

Understanding the dynamics behind domain architecture evolution is of great importance to unravel the functions of proteins. Complex architectures have been created throughout evolution by rearrangement and duplication events. An interesting question is how many times a particular architecture has been created, a form of convergent evolution or domain architecture reinvention. Previous studies have approached this issue by comparing architectures found in different species. We wanted to achieve a finer-grained analysis by reconstructing protein architectures on complete domain trees. The prevalence of domain architecture reinvention in 96 genomes was investigated with a novel domain tree-based method that uses maximum parsimony for inferring ancestral protein architectures. Domain architectures were taken from Pfam. To ensure robustness, we applied the method to bootstrap trees and only considered results with strong statistical support. We detected multiple origins for 12.4% of the scored architectures. In a much smaller data set, the subset of completely domain-assigned proteins, the figure was 5.6%. These results indicate that domain architecture reinvention is a much more common phenomenon than previously thought. We also determined which domains are most frequent in multiply created architectures and assessed whether specific functions could be attributed to them. However, no strong functional bias was found in architectures with multiple origins.

Keyword
protein, domain, architecture, evolution
National Category
Bioinformatics and Systems Biology
Research subject
Biochemistry with Emphasis on Theoretical Chemistry
Identifiers
urn:nbn:se:su:diva-14975 (URN)10.1093/molbev/msm254 (DOI)000253634800004 ()18025066 (PubMedID)
Available from: 2008-11-12 Created: 2008-11-12 Last updated: 2011-09-21Bibliographically approved
2. Predicting protein function from domain content
Open this publication in new window or tab >>Predicting protein function from domain content
2008 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1460-2059, Vol. 24, no 15, 1681-1687 p.Article in journal (Refereed) Published
Abstract [en]

MOTIVATION: Computational assignment of protein function may be the single most vital application of bioinformatics in the post-genome era. These assignments are made based on various protein features, where one is the presence of identifiable domains. The relationship between protein domain content and function is important to investigate, to understand how domain combinations encode complex functions.

RESULTS: Two different models are presented on how protein domain combinations yield specific functions: one rule-based and one probabilistic. We demonstrate how these are useful for Gene Ontology annotation transfer. The first is an intuitive generalization of the Pfam2GO mapping, and detects cases of strict functional implications of sets of domains. The second uses a probabilistic model to represent the relationship between domain content and annotation terms, and was found to be better suited for incomplete training sets. We implemented these models as predictors of Gene Ontology functional annotation terms. Both predictors were more accurate than conventional best BLAST-hit annotation transfer and more sensitive than a single-domain model on a large-scale dataset. We present a number of cases where combinations of Pfam-A protein domains predict functional terms that do not follow from the individual domains.

AVAILABILITY: Scripts and documentation are available for download at http://sonnhammer.sbc.su.se/multipfam2go_source_docs.tar

Keyword
Amino Acid Sequence, Computer Simulation, Models; Biological, Models; Chemical, Molecular Sequence Data, Protein Structure; Tertiary, Proteins/*chemistry/classification/*metabolism, Sequence Analysis; Protein/*methods, Structure-Activity Relationship
National Category
Bioinformatics and Systems Biology
Research subject
Biochemistry with Emphasis on Theoretical Chemistry
Identifiers
urn:nbn:se:su:diva-14973 (URN)10.1093/bioinformatics/btn312 (DOI)000257956600005 ()18591194 (PubMedID)
Available from: 2008-11-12 Created: 2008-11-12 Last updated: 2011-09-21Bibliographically approved
3. Benchmarking homology detection procedures with low complexity filters
Open this publication in new window or tab >>Benchmarking homology detection procedures with low complexity filters
2009 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1460-2059, Vol. 25, no 19, 2500-2505 p.Article in journal (Refereed) Published
Abstract [en]

BACKGROUND: Low-complexity sequence regions present a common problem in finding true homologs to a protein query sequence. Several solutions to this have been suggested, but a detailed comparison between these on challenging data has so far been lacking. A common benchmark for homology detection procedures is to use SCOP/ASTRAL domain sequences belonging to the same or different superfamilies, but these contain almost no low complexity sequences.

RESULTS: We here introduce an alternative benchmarking strategy based around Pfam domains and clans on whole-proteome data sets. This gives a realistic level of low complexity sequences. We used it to evaluate all six built-in BLAST low complexity filter settings as well as a range of settings in the MSPcrunch post-processing filter. The effect on alignment length was also assessed.

CONCLUSION: Score matrix adjustment methods provide a low false positive rate at a relatively small loss in sensitivity relative to no filtering, across the range of test conditions we apply. MSPcrunch achieved even less loss in sensitivity, but at a higher false positive rate. A drawback of the score matrix adjustment methods is however that the alignments often become truncated.

AVAILABILITY: Perl scripts for MSPcrunch BLAST filtering and for generating the benchmark dataset are available at http://sonnhammer.sbc.su.se/download/software/MSPcrunch+Blixem/benchmark.tar.gz

Identifiers
urn:nbn:se:su:diva-33341 (URN)10.1093/bioinformatics/btp446 (DOI)000270446400007 ()19620098 (PubMedID)
Available from: 2009-12-22 Created: 2009-12-22 Last updated: 2011-09-16Bibliographically approved
4. InParanoid 7: new algorithms and tools for eukaryotic orthology analysis
Open this publication in new window or tab >>InParanoid 7: new algorithms and tools for eukaryotic orthology analysis
Show others...
2010 (English)In: Nucleic Acids Research, ISSN 0305-1048, E-ISSN 1362-4962, Vol. 38, no 1, D196-D203 p.Article in journal (Refereed) Published
Abstract [en]

The InParanoid project gathers proteomes of completely sequenced eukaryotic species plus Escherichia coli and calculates pairwise ortholog relationships among them. The new release 7.0 of the database has grown by an order of magnitude over the previous version and now includes 100 species and their collective 1.3 million proteins organized into 42.7 million pairwise ortholog groups. The InParanoid algorithm itself has been revised and is now both more specific and sensitive. Based on results from our recent benchmarking of low-complexity filters in homology assignment, a two-pass BLAST approach was developed that makes use of high-precision compositional score matrix adjustment, but avoids the alignment truncation that sometimes follows. We have also updated the InParanoid web site (http://InParanoid.sbc.su.se). Several features have been added, the response times have been improved and the site now sports a new, clearer look. As the number of ortholog databases has grown, it has become difficult to compare among these resources due to a lack of standardized source data and incompatible representations of ortholog relationships. To facilitate data exchange and comparisons among ortholog databases, we have developed and are making available two XML schemas: SeqXML for the input sequences and OrthoXML for the output ortholog clusters.

National Category
Bioinformatics and Systems Biology
Research subject
Biochemistry with Emphasis on Theoretical Chemistry
Identifiers
urn:nbn:se:su:diva-34279 (URN)10.1093/nar/gkp931 (DOI)000276399100030 ()19892828 (PubMedID)
Available from: 2010-01-18 Created: 2010-01-07 Last updated: 2013-09-02Bibliographically approved
5. Orthology confers intron position conservation
Open this publication in new window or tab >>Orthology confers intron position conservation
2010 (English)In: BMC Genomics, ISSN 1471-2164, Vol. 11:412Article in journal (Refereed) Published
Abstract [en]

Background: With the wealth of genomic data available it has become increasingly important to assign putative protein function through functional transfer between orthologs. Therefore, correct elucidation of the evolutionary relationships among genes is a critical task, and attempts should be made to further improve the phylogenetic inference by adding relevant discriminating features. It has been shown that introns can maintain their position over long evolutionary timescales. For this reason, it could be possible to use conservation of intron positions as a discriminating factor when assigning orthology. Therefore, we wanted to investigate whether orthologs have a higher degree of intron position conservation (IPC) compared to non-orthologous sequences that are equally similar in sequence.

Results: To this end, we developed a new score for IPC and applied it to ortholog groups between human and six other species. For comparison, we also gathered the closest non-orthologs, meaning sequences close in sequence space, yet falling just outside the ortholog cluster. We found that ortholog-ortholog gene pairs on average have a significantly higher degree of IPC compared to ortholog-closest non-ortholog pairs. Also pairs of inparalogs were found to have a higher IPC score than inparalog-closest non-inparalog pairs. We verified that these differences can not simply be attributed to the generally higher sequence identity of the ortholog-ortholog and the inparalog-inparalog pairs. Furthermore, we analyzed the agreement between IPC score and the ortholog score assigned by the InParanoid algorithm, and found that it was consistently high for all species comparisons. In a minority of cases, the IPC and InParanoid score ranked inparalogs differently. These represent cases where sequence and intron position divergence are discordant. We further analyzed the discordant clusters to identify any possible preference for protein functions by looking for enriched GO terms and Pfam protein domains. They were enriched for functions important for multicellularity, which implies a connection between shifts in intronic structure and the origin of multicellularity.

Conclusions: We conclude that orthologous genes tend to have more conserved intron positions compared to non-orthologous genes. As a consequence, our IPC score is useful as an additional discriminating factor when assigning orthology.

National Category
Bioinformatics and Systems Biology
Research subject
Biochemistry with Emphasis on Theoretical Chemistry
Identifiers
urn:nbn:se:su:diva-49467 (URN)10.1186/1471-2164-11-412 (DOI)000280399500001 ()
Note
authorCount :3Available from: 2010-12-15 Created: 2010-12-14 Last updated: 2011-09-21Bibliographically approved
6. Domain architecture conservation in orthologs
Open this publication in new window or tab >>Domain architecture conservation in orthologs
2011 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 12, 326- p.Article in journal (Refereed) Published
Abstract [en]

Background. As orthologous proteins are expected to retain function more often than other homologs, they are often used for functional annotation transfer between species. However, ortholog identification methods do not take into account changes in domain architecture, which are likely to modify a protein's function. By domain architecture we refer to the sequential arrangement of domains along a protein sequence. To assess the level of domain architecture conservation among orthologs, we carried out a large-scale study of such events between human and 40 other species spanning the entire evolutionary range. We designed a score to measure domain architecture similarity and used it to analyze differences in domain architecture conservation between orthologs and paralogs relative to the conservation of primary sequence. We also statistically characterized the extents of different types of domain swapping events across pairs of orthologs and paralogs.

Results. The analysis shows that orthologs exhibit greater domain architecture conservation than paralogous homologs, even when differences in average sequence divergence are compensated for, for homologs that have diverged beyond a certain threshold. We interpret this as an indication of a stronger selective pressure on orthologs than paralogs to retain the domain architecture required for the proteins to perform a specific function. In general, orthologs as well as the closest paralogous homologs have very similar domain architectures, even at large evolutionary separation. The most common domain architecture changes observed in both ortholog and paralog pairs involved insertion/deletion of new domains, while domain shuffling and segment duplication/deletion were very infrequent.

Conclusions. On the whole, our results support the hypothesis that function conservation between orthologs demands higher domain architecture conservation than other types of homologs, relative to primary sequence conservation. This supports the notion that orthologs are functionally more similar than other types of homologs at the same evolutionary distance.

Keyword
orthologous proteins, domain architecture, homologs
National Category
Bioinformatics and Systems Biology
Research subject
Biochemistry with Emphasis on Theoretical Chemistry
Identifiers
urn:nbn:se:su:diva-60133 (URN)10.1186/1471-2105-12-326 (DOI)000294948100001 ()
Available from: 2011-08-09 Created: 2011-08-09 Last updated: 2017-12-08Bibliographically approved

Open Access in DiVA

fulltext(531 kB)735 downloads
File information
File name FULLTEXT02.pdfFile size 531 kBChecksum SHA-512
762d0749ad1c2871a7eccf5012a71590b911b0a43988e98b2e19cba9e503b36afaf7c04375b97576c83806de3e2a4bce8d4578c773ef9431c8cf5eaa34516fef
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Forslund, Kristoffer
By organisation
Department of Biochemistry and Biophysics
Bioinformatics and Systems Biology

Search outside of DiVA

GoogleGoogle Scholar
Total: 735 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 926 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf