Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Biological data exchange and the discovery of new protein families in metagenomic samples
Stockholm University, Faculty of Science, Department of Biochemistry and Biophysics. (Sonnhammer)
2012 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

The rise in sequence data has brought both challenges to the way we exchange biological information and opportunities to discover new protein families, primarily through the investigation of uncultured metagenomic samples.The Distributed Annotation System, or DAS, provided a means for exchanging protein sequence data, but there were no open source, stand-alone DAS clients optimized for integrating and viewing these data. To address this need, we developed DASher. Complementary to visualizing DAS data with DASher, we also created and made available ten servers to offer real-time protein feature predictions via DAS. While DAS works well for genomic data, there was no such framework for exchanging orthology data in a consistent way. Consequently, we developed the first standards for orthology data exchange, SeqXML and OrthoXML. 64 reference proteomes are now available in SeqXML, and 14 orthology providers have agreed to offer their predictions in OrthoXML. Besides creating a uniform representation of common data types, these standards enable direct comparison and assessment of competing methods for the first time.A substantial percentage of newly sequenced genes are ORFans, which have no match to previously known sequences. Metagenomics samples uncover sequences from uncultivable and therefore previously unseen species, and ORFans constitute much of the metagenomics data that are completely uncharacterized. ORFans are by definition impervious to standard similarity-based methods, and the few existing metagenomics gene-finding methods performed poorly on short, error-prone next-generation sequence data. Therefore, we designed a new approach to predict protein-coding gene families from metagenomic data and applied it to 17 virally-enriched metagenomes derived from human patients. Of the 456 putative ORFan families we found in the nearly 1 billion nucleotides sequenced from these libraries, we identified 32 putative novel protein families with strong support.

Place, publisher, year, edition, pages
Stockholm: Department of Biochemistry and Biophysics, Stockholm University , 2012. , 124 p.
National Category
Bioinformatics and Systems Biology
Research subject
Biochemistry with Emphasis on Theoretical Chemistry
Identifiers
URN: urn:nbn:se:su:diva-75108ISBN: 978-91-74474-52-7 (print)OAI: oai:DiVA.org:su-75108DiVA: diva2:514625
Public defence
2012-05-11, Magnélisalen, Kemiska övningslaboratoriet, Svante Arrhenius väg 16 B, Stockholm, 13:30 (English)
Opponent
Supervisors
Note

At the time of the doctoral defense, the following paper was unpublished and had a status as follows: Paper 4: Manuscript.

Available from: 2012-04-19 Created: 2012-04-05 Last updated: 2012-04-11Bibliographically approved
List of papers
1. DASher: a stand-alone protein sequence client for DAS, the Distributed Annotation System
Open this publication in new window or tab >>DASher: a stand-alone protein sequence client for DAS, the Distributed Annotation System
2009 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 25, no 10, 1333-1334 p.Article in journal (Refereed) Published
Abstract [en]

The rise in biological sequence data has led to a proliferation of separate, specialized databases. While there is great value in having many independent annotations, it is critical that there be a way to integrate them in one combined view. The Distributed Annotation System (DAS) was developed for that very purpose. There are currently no DAS clients that are open source, specialized for aggregating and comparing protein sequence annotation, and that can run as a self-contained application outside of a web browser. The speed, flexibility and extensibility that come with a stand-alone application motivated us to create DASher, an open-source Java DAS client. Given a UniProt sequence identifier, DASher automatically queries DAS-supporting servers worldwide for any information on that sequence and then displays the annotations in an interactive viewer for easy comparison. DASher is a fast, Java-based DAS client optimized for viewing protein sequence annotation and compliant with the latest DAS protocol specification 1.53E. AVAILABILITY: DASher is available for direct use and download at http://dasher.sbc.su.se including examples and source code under the GPLv3 licence. Java version 6 or higher is required.

Identifiers
urn:nbn:se:su:diva-34289 (URN)10.1093/bioinformatics/btp153 (DOI)000265950600020 ()19297349 (PubMedID)
Available from: 2010-01-18 Created: 2010-01-07 Last updated: 2017-12-12Bibliographically approved
2. MetaTM - a consensus method for transmembrane protein topology prediction
Open this publication in new window or tab >>MetaTM - a consensus method for transmembrane protein topology prediction
2009 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 10, 314- p.Article in journal (Refereed) Published
Abstract [en]

Transmembrane (TM) proteins are proteins that span a biological membrane one or more times. As their 3-D structures are hard to determine, experiments focus on identifying their topology (i. e. which parts of the amino acid sequence are buried in the membrane and which are located on either side of the membrane), but only a few topologies are known. Consequently, various computational TM topology predictors have been developed, but their accuracies are far from perfect. The prediction quality can be improved by applying a consensus approach, which combines results of several predictors to yield a more reliable result. RESULTS: A novel TM consensus method, named MetaTM, is proposed in this work. MetaTM is based on support vector machine models and combines the results of six TM topology predictors and two signal peptide predictors. On a large data set comprising 1460 sequences of TM proteins with known topologies and 2362 globular protein sequences it correctly predicts 86.7% of all topologies. CONCLUSION: Combining several TM predictors in a consensus prediction framework improves overall accuracy compared to any of the individual methods. Our proposed SVM-based system also has higher accuracy than a previous consensus predictor. MetaTM is made available both as downloadable source code and as DAS server at http://MetaTM.sbc.su.se.

Identifiers
urn:nbn:se:su:diva-34287 (URN)10.1186/1471-2105-10-314 (DOI)000271119400001 ()19785723 (PubMedID)
Available from: 2010-01-18 Created: 2010-01-07 Last updated: 2017-12-12Bibliographically approved
3. Letter to the Editor: SeqXML and OrthoXML: standards for sequence and orthology information
Open this publication in new window or tab >>Letter to the Editor: SeqXML and OrthoXML: standards for sequence and orthology information
2011 (English)In: Briefings in Bioinformatics, ISSN 1467-5463, E-ISSN 1477-4054, Vol. 12, no 5, 485-488 p.Article in journal (Refereed) Published
Abstract [en]

There is a great need for standards in the orthology field. Users must contend with different ortholog data representations from each provider, and the providers themselves must independently gather and parse the input sequence data. These burdensome and redundant procedures make data comparison and integration difficult. We have designed two XML-based formats, SeqXML and OrthoXML, to solve these problems. SeqXML is a lightweight format for sequence records the input for orthology prediction. It stores the same sequence and metadata as typical FASTA format records, but overcomes common problems such as unstructured metadata in the header and erroneous sequence content. XML provides validation to prevent data integrity problems that are frequent in FASTA files. The range of applications for SeqXML is broad and not limited to ortholog prediction. We provide read/write functions for BioJava, BioPerl, and Biopython. OrthoXML was designed to represent ortholog assignments from any source in a consistent and structured way, yet cater to specific needs such as scoring schemes or meta-information. A unified format is particularly valuable for ortholog consumers that want to integrate data from numerous resources, e. g. for gene annotation projects. Reference proteomes for 61 organisms are already available in SeqXML, and 10 orthology databases have signed on to OrthoXML. Adoption by the entire field would substantially facilitate exchange and quality control of sequence and orthology information.

Keyword
OrthoXML, SeqXML, 'sequence format', 'orthology format', FASTA format, XML
National Category
Natural Sciences
Identifiers
urn:nbn:se:su:diva-67273 (URN)10.1093/bib/bbr025 (DOI)000295171700013 ()
Note
authorCount :4Available from: 2011-12-30 Created: 2011-12-27 Last updated: 2017-12-08Bibliographically approved
4. Discovery of novel protein families in metagenomic samples
Open this publication in new window or tab >>Discovery of novel protein families in metagenomic samples
Show others...
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Despite the steady rise in gene sequence information, there is a persistent, significant fraction of genes which do not match any previously known sequence. These genes are called ORFans, and metagenomic samples, where DNA is extracted from a mixed population of unknown and often uncultivable species, are a rich source of ORFans. Viral infections cause significant morbidity and mortality, and identifying ORFan viral gene families from human metagenomic samples represents a route to understanding molecular processes that affect human health. Few methods exist for metagenomic gene-finding, and most of them rely on sequence similarity, which cannot be used to detect ORFans. Furthermore, nonsimilarity-based methods are hard to apply to the complex mixture of short, higherror-rate sequence fragments which are typical of metagenomic projects. Here we present an approach to detect ORFan protein families in short-read data, and apply it to 937 Mbp (megabase pairs) of sequence from 17 virus-enriched libraries made from human nasopharyngeal aspirates, serum, feces, and cerebrospinal fluid samples. After isolating approximately 450 putative ORFan families from clusters of sequence contigs, we applied RNAcode, a gene finder developed for use on high-quality genome sequences, and calibrated it for errorprone short sequence reads. Additional predictive measures such as sequence complexity and length were then used to rank and filter candidates into a high-quality set of 32 putative novel gene families, only two of which show significant similarity to known genes.

Keyword
metagenomics, novel genes, human virome, gene prediction
National Category
Bioinformatics and Systems Biology
Identifiers
urn:nbn:se:su:diva-75107 (URN)
Available from: 2012-04-10 Created: 2012-04-05 Last updated: 2012-04-11Bibliographically approved

Open Access in DiVA

No full text

Search in DiVA

By author/editor
Messina, David N.
By organisation
Department of Biochemistry and Biophysics
Bioinformatics and Systems Biology

Search outside of DiVA

GoogleGoogle Scholar

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 227 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf