High quality analysis of genomic data

Bioinformatics overview

Our bioinformaticians are highly experienced in handling and analysing next-generation sequencing data and support researchers across a wide range of disciplines both within Oxford and beyond.

Past and present members of the Bioinformatics team have played key roles in high profile initiatives including WGS500, the 1000 Genomes Project, the HapMap Project, and most recently the first sequencing of human genomes using Oxford Nanopore’s MinION™ device.

Who are we?

 

The Bioinformatics team maintains and runs computational pipelines for processing raw sequence data and data quality checks for all data generated within the Oxford Genomics Centre, as well as performing high quality analyses for the following applications:

  • Genomics (whole-genome, exome and targeted sequencing)
  • Transcriptomics (RNA-Seq, miRNA-Seq, microarrays)
  • Epigenetics (ChIP-Seq, ATAC-Seq, methylation)
  • Single cell genomics (Fluidigm® C1™, FACS)

  • Advice on experimental design
  • Initial bioinformatics processing steps
  • Data quality checks
  • Extensive informatics support for tracking, processing and managing large volumes of raw sequence data
  • Training and advice on the use of a variety of bioinformatics tools and analyses (both individually and through workshops)
  • Collaborating with research groups to perform downstream analyses*
  • Developing software tools to aid scientists in the analysis of genomic datasets
  • Evaluating data from trials of cutting-edge technologies
  • Assistance with submitting data to public repositories

*In-depth downstream analysis is primarily for groups based at the Wellcome Trust Centre for Human Genetics, but please contact us for more information on the support available for your project.


Standard data processing and quality control

All data generated within the Oxford Genomics Centre undergoes a rigorous quality control procedure to ensure that researchers receive the best data and to identify any issues arising. Data is processed through our primary pipeline to generate raw sequence data (FASTQ files) and usually aligned reads (BAM files) for each sequenced sample. Associated quality control (QC) metrics are also provided and data is made available via an ftp link.

Our data storage policy includes keeping the raw data for up to 2 years in archive storage, although we strongly recommend prompt download and storage of all data provided. The bioinformatics support is intended to enable individual research groups to take advantage of the latest genomics techniques without needing their own extensive computing/bioinformatics resources. Additional bioinformatics steps that may be performed on request (where resources permit) are detailed below for different applications.

High-performance computing cluster at the WTCHG.


Analysis applications

Figure shows the substitution of 4.5kb portion of mm9 genomic reference with a transgene.

Whole-genome, Exome and Targeted sequencing Analysis

Whole-Genome Sequencing (WGS) is a process that determines the complete DNA sequence of an organism’s genome. This type of sequencing is mainly used in projects aimed at detecting the presence of SNPs, Indels and structural variants in a sample.

Exome sequencing is a technique to selectively sequence the coding regions of the genome as an effective alternative to whole-genome sequencing. This type of sequencing is most commonly used in projects aimed at detecting variants related to disease-causing protein structural and functional changes.

With targeted resequencing, a subset of genes or regions of interest are isolated and sequenced in high resolution. Pre-defined or customised gene panels enable researchers to focus on specific areas of interest, saving time and sequencing costs.

The following files can be provided with whole-genome, exome or targeted sequencing projects:

  • Primary QC report (general QC metrics on sequencing quality for each sequenced sample)
  • FASTQ (raw sequence data) and BAM (alignment) files for each sample

Bespoke downstream DNA sequencing analyses*:

  • Annotated VCF files (list of identified variants annotated with a number of databases and with their potential effect on genes, transcripts, proteins, as well as regulatory regions)
  • Secondary QC report (a set of QC metrics including coverage analysis and variant statistics for each sample)
  • Variant filtering, based on specific cutoffs and user-defined criteria
  • CNV detection analysis
  • Customised downstream variant analysis

*Please note that bespoke analyses might be subject to availability and resources, please contact Oxford Genomics Centre for more information.

Figure shows alternative splicing occuring in exons 2 and 3 of EIF2AK1 between normoxic and hypoxic conditions.

RNA-Seq Analysis

RNA-Seq enables the profiling of the entire transcriptome in any organism. This type of sequencing is commonly used in projects that aim to either quantify the levels of gene expression, detect differential expression or detect alternative splicing in a sample. RNA-Seq can be performed either on bulk samples or on single cells.

The following files can be provided with RNA-Seq projects:

  • Primary QC report (general QC metrics on sequencing quality for each sequenced sample)
  • FASTQ (raw sequence data) and BAM (alignment) files for each sample
  • Count matrix file (raw and normalised counts for annotated genes)

Bespoke downstream RNA-Seq analyses*:

  • Secondary QC report (a set of QC metrics to assess the quality of data and exploratory clustering plots)
  • Differential expression analysis
  • Alternative splicing analysis
  • Gene ontology and functional pathway enrichment analysis
  • Customised RNA-Seq downstream analyses

*Please note that bespoke analyses might be subject to availability and resources, please contact Oxford Genomics Centre for more information.

Figure shows transcription factor binding at the mouse mm9 PLXND1 gene.

ChIP-Seq Analysis

ChIP-Seq combines chromatin immunoprecipitation (ChIP) with sequencing to identify the binding sites of DNA-associated proteins (e.g. transcription factors) and histone modifications genome-wide.

The following files can be provided with ChIP-Seq projects:

  • Primary QC report (general QC metrics on sequencing quality for each sequenced sample)
  • FASTQ (raw sequence data) and BAM (alignment) files for each sample
  • Peak calling output files (list of ChIP-enriched regions identified)

Bespoke downstream ChIP-Seq analyses*:

  • Secondary QC report (a set of QC metrics to assess the enrichment of the IP and the reproducibility of the experiment)
  • Peaks annotated with nearest gene information
  • Analysis of peak distribution with respect to genomic features
  • Differential binding analysis
  • Gene ontology and functional pathway enrichment analysis
  • Known and de novo motif search
  • Customised ChIP-Seq downstream analyses

*Please note that bespoke analyses might be subject to availability and resources, please contact Oxford Genomics Centre for more information.

Flowchart fo miRNA-Seq data analysis. Source: Wikimedia Commons.

miRNA-Seq Analysis

Micro-RNA (miRNA) profiling and discovery is a popular field in the biological research community as these short nucleotide sequences play an important role in gene regulation.

The following files can be provided with miRNA-Seq projects:

  • Primary QC report (general QC metrics on sequencing quality for each sequenced sample)
  • FASTQ (raw sequence data) and BAM (alignment) files for each sample
  • Count matrix file (raw and normalised counts for annotated miRNAs)

Bespoke downstream miRNA-Seq analyses*:

  • Differential expression analysis of miRNAs
  • Target gene identification
  • Customised miRNA-Seq downstream analyses

*Please note that bespoke analyses might be subject to availability and resources, please contact Oxford Genomics Centre for more information.

Examples of different QC metrics from in-house ATAC-Seq data.

ATAC-Seq Analysis

ATAC-Seq is a relatively new technique that enables the study of chromatin accessibility using a faster approach than DNAse-Seq. It aims to identify DNaseI hypersensitive sites (i.e. all DNA-accessible regions) to determine DNA-binding proteins’ “footprints” and to infer nucleosome positions. ATAC-Seq can be performed either on bulk samples or on single cells.

The following files can be provided with ATAC-Seq projects:

  • Primary QC report (general QC metrics on sequencing quality for each sequenced sample)
  • FASTQ (raw sequence data) and BAM (alignment) files for each sample

Bespoke downstream ATAC-Seq analyses*:

  • Secondary QC report (a set of QC metrics including fragment size distribution, reproducibility of the experiment and genomic features enrichment)
  • Full peak annotation (including nearest gene annotation and overlap with Encode’s TFs, DNaseI and other tracks of interest)
  • Customised ATAC-Seq downstream analyses

*Please note that bespoke analyses might be subject to availability and resources, please contact Oxford Genomics Centre for more information.

Figure shows differential expression of Foxp2 target genes in E16 developing mouse ganglionic eminences.

Microarray Analysis

Microarray technologies remain a popular alternative to sequencing in many situations and gene expression (including exon arrays), methylation and SNP genotyping arrays are offered by the Oxford Genomics Centre.

The following files can be provided with microarray projects:

  • Primary QC report (a set of QC metrics to assess the quality of biological and technical aspects of each sample)
  • Raw data files (e.g. CEL files)
  • Probe signal intensity files

Bespoke downstream microarray analysis*:

  • Secondary QC report (a set of QC metrics to assess the quality of data and exploratory clustering plots)
  • Differential gene expression
  • Functional enrichment and pathway analysis
  • Analysis of custom arrays

*Please note that bespoke analyses might be subject to availability and resources, please contact Oxford Genomics Centre for more information.

 


 

Platypus

Platypus is a tool designed for efficient and accurate variant detection in high-throughput sequencing data. Developed by: Dr Andrew Rimmer, Iain Mathieson, Dr Gerton Lunter.

More about Platypus

Stampy

Stampy is a powerful tool designed for ultra-sensitive and specific mapping of short read sequences onto a reference genome. Developed by: Dr Gerton Lunter.

More about Stampy

GREVE

The Genomic Recurrent Event ViEwer (GREVE) software tool creates genome-wide or per chromosome plots and summaries of a user-generated list of events, typically Copy Number Variations (CNVs), and genes. Developed by: Dr Jean-Baptiste Cazier.

More about GREVE

HAPPY

HAPPY is designed to map Quantitative Trait Loci (QTL) in outbred populations descended from crosses between inbred lines. Developed by: Prof Richard Mott.

More about HAPPY

BrowseVCF

BrowseVCF is a web-based application and workflow to quickly prioritise disease-causative variants in VCF files. Developed by: Dr Silvia Salatino and Dr Varun Ramraj.

More about BrowseVCF

Opossum

Opossum is a tool to pre-process RNA-seq reads prior to variant calling. Developed by: Dr Laura Oikkonen and Dr Stefano Lise.

More about opossum