Discovery of novel protein families in metagenomic samples
(English)Manuscript (preprint) (Other academic)
Despite the steady rise in gene sequence information, there is a persistent, significant fraction of genes which do not match any previously known sequence. These genes are called ORFans, and metagenomic samples, where DNA is extracted from a mixed population of unknown and often uncultivable species, are a rich source of ORFans. Viral infections cause significant morbidity and mortality, and identifying ORFan viral gene families from human metagenomic samples represents a route to understanding molecular processes that affect human health. Few methods exist for metagenomic gene-finding, and most of them rely on sequence similarity, which cannot be used to detect ORFans. Furthermore, nonsimilarity-based methods are hard to apply to the complex mixture of short, higherror-rate sequence fragments which are typical of metagenomic projects. Here we present an approach to detect ORFan protein families in short-read data, and apply it to 937 Mbp (megabase pairs) of sequence from 17 virus-enriched libraries made from human nasopharyngeal aspirates, serum, feces, and cerebrospinal fluid samples. After isolating approximately 450 putative ORFan families from clusters of sequence contigs, we applied RNAcode, a gene finder developed for use on high-quality genome sequences, and calibrated it for errorprone short sequence reads. Additional predictive measures such as sequence complexity and length were then used to rank and filter candidates into a high-quality set of 32 putative novel gene families, only two of which show significant similarity to known genes.
metagenomics, novel genes, human virome, gene prediction
Bioinformatics and Systems Biology
IdentifiersURN: urn:nbn:se:su:diva-75107OAI: oai:DiVA.org:su-75107DiVA: diva2:514207