A general framework for predicting the transcriptomic consequences of non-coding variation

Here, we introduce a new computational framework, called promoter-and-enhancer-derived abundance (peaBrain) model, that can be used to predict the mean tissue-specific abundance of all genes and can incorporate the transcriptomic consequences of genotype variation to predict individual abundance on a subject-by-subject basis. In the associated manuscript, we highlight how this peaBrain model can be used to investigate and characterise the transcriptomic consequences of both common and rare non-coding variation. The manuscript is available on bioRxiv.

We briefly describe the model and available supplementary materials below. For quick navigation, please use the links in the sidebar.

Stage 1 peaBrain

In Stage 1, we constructed a single model to predict the mean abundance of all genes in any given tissue from the reference genome, optionally annotated with epigenetic and genomic annotations. We applied this framework to all tissues from the GTEx dataset, constructing three classes of models: (a) using DNA sequence alone (class-A); (b) using DNA plus epigenomic annotations not specific to any tissue or cell type (i.e. non-specific annotations) (class-B); and (c) using DNA combined with both non-specific and tissue-specific annotations (class-C). We have provided all code and data necessary to generate the results for class-A and class-B models. Due to storage constraints, we provide training/test data only for skeletal muscle. Expression data for other tissues is available from GTEx. The original data sources used to train class-C models are detailed in the manuscript.

Using the Stage 1 class-B models, we generated a non-coding impact metric that captured the impact of each position in the core promoter sequence on the expression of each gene. The peaBrain impact scores for all GTEx tissues have been made available. In the manuscript, we show that this impact score correlates with nucleotide evolutionary constraint and is also predictive of disease-associated variation and allele-specific transcription factor binding. We also highlight how tissue-specific peaBrain scores can be leveraged to pinpoint functional tissues underlying complex traits, outperforming methods that depend on colocalization of eQTL and GWAS signals.

Stage 2 peaBrain

In Stage 2, we extended the peaBrain model to incorporate the transcriptomic consequences of individual genotype variation. In the manuscript, we describe the ability of this extended peaBrain model to predict the tissue-specific expression profile of each individual and to identify putatively functional variants within the sequence. Sample code has been provided. Individual level data is available from GTEx. If you would like access to the complete peaBrain workflow (including the pre-processed individual-level data used in training), please contact us by email (to sort out approval & ethics).

The weights for all gene-level models will be uploaded here soon.