Background: There is tremendous potential for genome sequencing to improve clinical diagnosis and care once it becomes routinely accessible, but this will require formalizing research methods into clinical best practices in the areas of sequence data generation, analysis, interpretation and reporting. The CLARITY Challenge was designed to spur convergence in methods for diagnosing genetic disease starting from clinical case history and genome sequencing data. DNA samples were obtained from three families with heritable genetic disorders and genomic sequence data were donated by sequencing platform vendors. The challenge was to analyze and interpret these data with the goals of identifying disease-causing variants and reporting the findings in a clinically useful format. Participating contestant groups were solicited broadly, and an independent panel of judges evaluated their performance. Results: A total of 30 international groups were engaged. The entries reveal a general convergence of practices on most elements of the analysis and interpretation process. However, even given this commonality of approach, only two groups identified the consensus candidate variants in all disease cases, demonstrating a need for consistent fine-tuning of the generally accepted methods. There was greater diversity of the final clinical report content and in the patient consenting process, demonstrating that these areas require additional exploration and standardization. Conclusions: The CLARITY Challenge provides a comprehensive assessment of current practices for using genome sequencing to diagnose and report genetic diseases. There is remarkable convergence in bioinformatic techniques, but medical interpretation and reporting are areas that require further development by many groups.
Background: RNA editing by adenosine to inosine deamination is a widespread phenomenon, particularly frequent in the human transcriptome, largely due to the presence of inverted Alu repeats and their ability to form double-stranded structures - a requisite for ADAR editing. While several hundred thousand editing sites have been identified within these primate-specific repeats, the function of Alu-editing has yet to be elucidated. Results: We show that inverted Alu repeats, expressed in the primate brain, can induce site-selective editing in cis on sites located several hundred nucleotides from the Alu elements. Furthermore, a computational analysis, based on available RNA-seq data, finds that site-selective editing occurs significantly closer to edited Alu elements than expected. These targets are poorly edited upon deletion of the editing inducers, as well as in homologous transcripts from organisms lacking Alus. Sequences surrounding sites near edited Alus in UTRs, have been subjected to a lesser extent of evolutionary selection than those far from edited Alus, indicating that their editing generally depends on cis-acting Alus. Interestingly, we find an enrichment of primate-specific editing within encoded sequence or the UTRs of zinc finger-containing transcription factors. Conclusions: We propose a model whereby primate-specific editing is induced by adjacent Alu elements that function as recruitment elements for the ADAR editing enzymes. The enrichment of site-selective editing with potentially functional consequences on the expression of transcription factors indicates that editing contributes more profoundly to the transcriptomic regulation and repertoire in primates than previously thought.
Background: Adenosine to inosine (A-to-I) RNA editing has been shown to be an essential event that plays a significant role in neuronal function, as well as innate immunity, in mammals. It requires a structure that is largely double-stranded for catalysis but little is known about what determines editing efficiency and specificity in vivo. We have previously shown that some editing sites require adjacent long stem loop structures acting as editing inducer elements (EIEs) for efficient editing. Results: The glutamate receptor subunit A2 is edited at the Q/R site in almost 100% of all transcripts. We show that efficient editing at the Q/R site requires an EIE in the downstream intron, separated by an internal loop. Also, other efficiently edited sites are flanked by conserved, highly structured EIEs and we propose that this is a general requisite for efficient editing, while sites with low levels of editing lack EIEs. This phenomenon is not limited to mRNA, as non-coding primary miRNAs also use EIEs to recruit ADAR to specific sites. Conclusions: We propose a model where two regions of dsRNA are required for efficient editing: first, an RNA stem that recruits ADAR and increases the local concentration of the enzyme, then a shorter, less stable duplex that is ideal for efficient and specific catalysis. This discovery changes the way we define and determine a substrate for A-to-I editing. This will be important in the discovery of novel editing sites, as well as explaining cases of altered editing in relation to disease.
Background: The lifelong accumulation of somatic mutations underlies age-related phenotypes and cancer. Mutagenic forces are thought to shape the genome of aging cells in a tissue-specific way. Whole genome analyses of somatic mutation patterns, based on both types and genomic distribution of variants, can shed light on specific processes active in different human tissues and their effect on the transition to cancer. Results: To analyze somatic mutation patterns, we compile a comprehensive genetic atlas of somatic mutations in healthy human cells. High-confidence variants are obtained from newly generated and publicly available whole genome DNA sequencing data from single non-cancer cells, clonally expanded in vitro. To enable a well-controlled comparison of different cell types, we obtain single genome data (92% mean coverage) from multi-organ biopsies from the same donors. These data show multiple cell types that are protected from mutagens and display a stereotyped mutation profile, despite their origin from different tissues. Conversely, the same tissue harbors cells with distinct mutation profiles associated to different differentiation states. Analyses of mutation rate in the coding and non-coding portions of the genome identify a cell type bearing a unique mutation pattern characterized by mutation enrichment in active chromatin, regulatory, and transcribed regions. Conclusions: Our analysis of normal cells from healthy donors identifies a somatic mutation landscape that enhances the risk of tumor transformation in a specific cell population from the kidney proximal tubule. This unique pattern is characterized by high rate of mutation accumulation during adult life and specific targeting of expressed genes and regulatory regions.
Background: Formation of tissue-specific transcriptional programs underlies multicellular development, including dorsoventral (DV) patterning of the Drosophila embryo. This involves interactions between transcriptional enhancers and promoters in a chromatin context, but how the chromatin landscape influences transcription is not fully understood.Results: Here we comprehensively resolve differential transcriptional and chromatin states during Drosophila DV patterning. We find that RNA Polymerase II pausing is established at DV promoters prior to zygotic genome activation (ZGA), that pausing persists irrespective of cell fate, but that release into productive elongation is tightly regulated and accompanied by tissue-specific P-TEFb recruitment. DV enhancers acquire distinct tissue-specific chromatin states through CBP-mediated histone acetylation that predict the transcriptional output of target genes, whereas promoter states are more tissue-invariant. Transcriptome-wide inference of burst kinetics in different cell types revealed that while DV genes are generally characterized by a high burst size, either burst size or frequency can differ between tissues.Conclusions: The data suggest that pausing is established by pioneer transcription factors prior to ZGA and that release from pausing is imparted by enhancer chromatin state to regulate bursting in a tissue-specific manner in the early embryo. Our results uncover how developmental patterning is orchestrated by tissue-specific bursts of transcription from Pol II primed promoters in response to enhancer regulatory cues.
Background: Conventional wisdom holds that, owing to the dominance of features such as chromatin level control, the expression of a gene cannot be readily predicted from knowledge of promoter architecture. This is reflected, for example, in a weak or absent correlation between promoter divergence and expression divergence between paralogs. However, an inability to predict may reflect an inability to accurately measure or employment of the wrong parameters. Here we address this issue through integration of two exceptional resources: ENCODE data on transcription factor binding and the FANTOM5 high-resolution expression atlas. Results: Consistent with the notion that in eukaryotes most transcription factors are activating, the number of transcription factors binding a promoter is a strong predictor of expression breadth. In addition, evolutionarily young duplicates have fewer transcription factor binders and narrower expression. Nonetheless, we find several binders and cooperative sets that are disproportionately associated with broad expression, indicating that models more complex than simple correlations should hold more predictive power. Indeed, a machine learning approach improves fit to the data compared with a simple correlation. Machine learning could at best moderately predict tissue of expression of tissue specific genes. Conclusions: We find robust evidence that some expression parameters and paralog expression divergence are strongly predictable with knowledge of transcription factor binding repertoire. While some cooperative complexes can be identified, consistent with the notion that most eukaryotic transcription factors are activating, a simple predictor, the number of binding transcription factors found on a promoter, is a robust predictor of expression breadth.
We present here miRTrace, the first algorithm to trace microRNA sequencing data back to their taxonomic origins. This is a challenge with profound implications for forensics, parasitology, food control, and research settings where cross-contamination can compromise results. miRTrace accurately (> 99%) assigns real and simulated data to 14 important animal and plant groups, sensitively detects parasitic infection in mammals, and discovers the primate origin of single cells. Applying our algorithm to over 700 public datasets, we find evidence that over 7% are cross-contaminated and present a novel solution to clean these computationally, even after sequencing has occurred.
Background: Recent work has demonstrated that three-dimensional genome organization is directly affected by changes in the levels of nuclear cytoskeletal proteins such as β-actin. The mechanisms which translate changes in 3D genome structure into changes in transcription, however, are not fully understood. Here, we use a comprehensive genomic analysis of cells lacking nuclear β-actin to investigate the mechanistic links between compartment organization, enhancer activity, and gene expression.
Results: Using HiC-Seq, ATAC-Seq, and RNA-Seq, we first demonstrate that transcriptional and chromatin accessibility changes observed upon β-actin loss are highly enriched in compartment-switching regions. Accessibility changes within compartment switching genes, however, are mainly observed in non-promoter regions which potentially represent distal regulatory elements. Our results also show that β-actin loss induces widespread accumulation of the enhancer-specific epigenetic mark H3K27ac. Using the ABC model of enhancer annotation, we then establish that these epigenetic changes have a direct impact on enhancer activity and underlie transcriptional changes observed upon compartment switching. A complementary analysis of fibroblasts undergoing reprogramming into pluripotent stem cells further confirms that this relationship between compartment switching and enhancer-dependent transcriptional change is not specific to β-actin knockout cells but represents a general mechanism linking compartment-level genome organization to gene expression.
Conclusions: We demonstrate that enhancer-dependent transcriptional regulation plays a crucial role in driving gene expression changes observed upon compartment-switching. Our results also reveal a novel function of nuclear β-actin in regulating enhancer function by influencing H3K27 acetylation levels.
Analysis of microbial data from archaeological samples is a growing field with great potential for understanding ancient environments, lifestyles, and diseases. However, high error rates have been a challenge in ancient metagenomics, and the availability of computational frameworks that meet the demands of the field is limited. Here, we propose aMeta, an accurate metagenomic profiling workflow for ancient DNA designed to minimize the amount of false discoveries and computer memory requirements. Using simulated data, we benchmark aMeta against a current state-of-the-art workflow and demonstrate its superiority in microbial detection and authentication, as well as substantially lower usage of computer memory.
Read alignment is often the computational bottleneck in analyses. Recently, several advances have been made on seeding methods for fast sequence comparison. We combine two such methods, syncmers and strobemers, in a novel seeding approach for constructing dynamic-sized fuzzy seeds and implement the method in a short-read aligner, strobealign. The seeding is fast to construct and effectively reduces repetitiveness in the seeding step, as shown using a novel metric E-hits. strobealign is several times faster than traditional aligners at similar and sometimes higher accuracy while being both faster and more accurate than more recently proposed aligners for short reads of lengths 150nt and longer. Availability: https://github.com/ksahlin/strobealign
It has been over a decade since the first publication of a method dedicated entirely to mapping long-reads. The distinctive characteristics of long reads resulted in methods moving from the seed-and-extend framework used for short reads to a seed-and-chain framework due to the seed abundance in each read. The main novelties are based on alternative seed constructs or chaining formulations. Dozens of tools now exist, whose heuristics have evolved considerably. We provide an overview of the methods used in long-read mappers. Since they are driven by implementation-specific parameters, we develop an original visualization tool to understand the parameter settings (http:// bcazaux.polytech-lille.net/Minimap2/).