Change search
ReferencesLink to record
Permanent link

Direct link
Benchmarking homology detection procedures with low complexity filters
Stockholm University, Faculty of Science, Department of Biochemistry and Biophysics.
Stockholm University, Faculty of Science, Department of Biochemistry and Biophysics.
2009 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1460-2059, Vol. 25, no 19, 2500-2505 p.Article in journal (Refereed) Published
Abstract [en]

BACKGROUND: Low-complexity sequence regions present a common problem in finding true homologs to a protein query sequence. Several solutions to this have been suggested, but a detailed comparison between these on challenging data has so far been lacking. A common benchmark for homology detection procedures is to use SCOP/ASTRAL domain sequences belonging to the same or different superfamilies, but these contain almost no low complexity sequences.

RESULTS: We here introduce an alternative benchmarking strategy based around Pfam domains and clans on whole-proteome data sets. This gives a realistic level of low complexity sequences. We used it to evaluate all six built-in BLAST low complexity filter settings as well as a range of settings in the MSPcrunch post-processing filter. The effect on alignment length was also assessed.

CONCLUSION: Score matrix adjustment methods provide a low false positive rate at a relatively small loss in sensitivity relative to no filtering, across the range of test conditions we apply. MSPcrunch achieved even less loss in sensitivity, but at a higher false positive rate. A drawback of the score matrix adjustment methods is however that the alignments often become truncated.

AVAILABILITY: Perl scripts for MSPcrunch BLAST filtering and for generating the benchmark dataset are available at

Place, publisher, year, edition, pages
2009. Vol. 25, no 19, 2500-2505 p.
URN: urn:nbn:se:su:diva-33341DOI: 10.1093/bioinformatics/btp446ISI: 000270446400007PubMedID: 19620098OAI: diva2:283023
Available from: 2009-12-22 Created: 2009-12-22 Last updated: 2011-09-16Bibliographically approved
In thesis
1. The relationship between orthology, protein domain architecture and protein function
Open this publication in new window or tab >>The relationship between orthology, protein domain architecture and protein function
2011 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Lacking experimental data, protein function is often predicted from evolutionary and protein structure theory. Under the 'domain grammar' hypothesis the function of a protein follows from the domains it encodes. Under the 'orthology conjecture', orthologs, related through species formation, are expected to be more functionally similar than paralogs, which are homologs in the same or different species descended from a gene duplication event. However, these assumptions have not thus far been systematically evaluated.

To test the 'domain grammar' hypothesis, we built models for predicting function from the domain combinations present in a protein, and demonstrated that multi-domain combinations imply functions that the individual domains do not. We also developed a novel gene-tree based method for reconstructing the evolutionary histories of domain architectures, to search for cases of architectures that have arisen multiple times in parallel, and found this to be more common than previously reported.

To test the 'orthology conjecture', we first benchmarked methods for homology inference under the obfuscating influence of low-complexity regions, in order to improve the InParanoid orthology inference algorithm. InParanoid was then used to test the relative conservation of functionally relevant properties between orthologs and paralogs at various evolutionary distances, including intron positions, domain architectures, and Gene Ontology functional annotations.

We found an increased conservation of domain architectures in orthologs relative to paralogs, in support of the 'orthology conjecture' and the 'domain grammar' hypotheses acting in tandem. However, equivalent analysis of Gene Ontology functional conservation yielded spurious results, which may be an artifact of species-specific annotation biases in functional annotation databases. I discuss possible ways of circumventing this bias so the 'orthology conjecture' can be tested more conclusively.

Place, publisher, year, edition, pages
Stockholm: Department of Biochemistry and Biophysics, Stockholm University, 2011. 112 p.
homology, orthology, paralogy, gene duplications, protein function prediction, low-complexity regions, protein domains, domain architecture evolution, introns, intron position conservation, orthology conjecture, domain grammar hypothesis
National Category
Bioinformatics and Systems Biology
Research subject
Biochemistry with Emphasis on Theoretical Chemistry
urn:nbn:se:su:diva-62152 (URN)978-91-7447-350-6 (ISBN)
Public defence
2011-10-24, Magnélisalen, Kemiska övningslaboratoriet, Svante Arrhenius väg 16 B, Stockholm, 14:00 (English)
At the time of the doctoral defense, the following paper was unpublished and had a status as follows: Paper 6: Epub ahead of print.Available from: 2011-10-02 Created: 2011-09-09 Last updated: 2011-10-06Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full textPubMed

Search in DiVA

By author/editor
Forslund, KristofferSonnhammer, Erik L.L.
By organisation
Department of Biochemistry and Biophysics
In the same journal

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Altmetric score

Total: 34 hits
ReferencesLink to record
Permanent link

Direct link