Primary Data QC

Go Back

Ensuring high data quality through a rigorous QC at the OGC

The quality of the sequencing data generated at the Oxford Genomics Centre (OGC) is analysed using an in-house developed primary Quality Control (QC) pipeline. Depending on the library type, a number of informative metrics are generated and provided to the researchers as an HTML file – which can be opened with any web browser – to easily assess the success of a given sequencing run, along with data download links. On a sample level, this primary QC can also be compared to and validated with material QC and other laboratory observations collected during critical steps in the library preparation process. All the data are manually evaluated by the OGC team against known standards before being passed for release or downstream analysis.

Detailed explanations of each metric are available by clicking at any time on the “Metrics Info” and/or “Plots Info” buttons embedded in the HTML report. Additionally, a number of interactive tools are provided next to each graph to zoom, drag and visually inspect the data, as well as showing specific values by mouse hovering.

 

Lane QC Summary and QC Statistics

This section shows information regarding the sequencing run, followed by a table with a set of metrics that provide an overview of the lane (Figure 1). As shown on the blue table at Figure 1, paired-end libraries are divided into read 1 and read 2 for an in-depth analysis.

Figure 1: “Lane QC Summary” section

 

The next section has a table also splitting metrics by read 1 and read 2 allowing to easily identify either known or unexpected biases (Figure 2).

Figure 2: “QC Statistics” section showing metrics for each read in lane 3 (3.1 and 3.2, read 1 and read 2, respectively)

QC Plots

Figure 3 shows the primary QC graphs using merged data from both read 1 and read 2 for a Whole-Genome Sequencing (WGS) project. The first graph (“%GC over reads and genome”) shows the GC content of the whole reference genome (dotted line) and of the sequenced reads (continuous line). The similarity between these two distributions indicates a good and heterogeneous collection of reads throughout the reference genome. As a result, the coverage is relatively constant over the genome’s GC content, which in this case is best seen in a log10 scale.

Figure 3: Example of the main QC graphs generated for a WGS project

 

Furthermore, detailed graphs for each sequencing-by-synthesis cycle (corresponding to a base of the nucleic acid being processed) are also reported (Figure 4).

Figure 4: Example of read-specific graphs for a paired-end project; the bottom row suggests that (as expected) the quality of read 2 is lower than that of read 1, particularly towards the end of the sequence

Multiplex QC Statistics and Plots

A table with metrics for each sample is presented in a friendly manner (Figure 5) and, to be able to locate outliers, graphs are shown (Figure 6). Additionally, by clicking on the “Open Sample Level Info” button, the multiplex barcodes – as well as a number of other sample-related information – are also reported.

Figure 5: Example of the sample-specific metrics

 

Figure 6: Example of the sample-specific graphs to visualise outliers

Tile Metrics

Finally, tile-specific metrics are shown, for an in-depth quality check (Figure 7). These metrics are hidden and can be shown clicking on “Tile Metrics” or have an overview of highlighted metrics visualising the graphs.

Figure 7: “Tile Metrics” section with hidden table

 

RNA-Seq Specific Metrics

Due to the intrinsic diversity of the sequencing applications being processed at the OGC, it became imperative to add new individual metrics to thoroughly analyse the quality of each specific data type, rather than applying a standard set of QC metrics to all projects. For example, an RNA-Seq PolyA project (Figure 8) will show a characteristic feature counts histogram, which is a gene-level quantification of the reads sequenced per sample and stratified by feature type. This graph, complemented with other metrics presented later, allows to identify any unexpected read distribution that could suggest, for instance, issues with rRNA depletion or contamination.

Figure 8: Example of the library type specific graphs for an RNA-Seq project

 

 

The primary QC is constantly updated as a result of the rapid pace of development at the OGC. If you have any questions regarding the QC report, please don’t hesitate to contact us, our analysts will be happy to provide you with more detailed information and guidance.

 

Author: Raquel Silva