In the course of the last decade, genomic research has involved the use of massively parallel sequencing technologies in order to achieve deep sequencing coverage, fast turnaround times, high efficiency and resolution, all at relatively low cost per base. Unlike the Sanger sequencing technique, Next generation sequencing (NGS) platforms can perform the sequencing of millions of DNA molecules simultaneously generating massive amounts of sequence data1. Among these platforms, the Illumina platforms are the most widely used for genome sequencing with the least error rate per base2.
The chemistry behind the Illumina platforms adopts the strategy of “reversible terminator sequencing”. The DNA template strand is immobilized onto the flow cell containing oligos which are complementary to the adaptors ligated to the DNA strand. Each fragment is then copied by bridge amplification3,4 forming clusters with thousands of identical DNA sequences. The clustering process can occur either directly on the platform or separately on the cBot, depending upon the chosen Illumina system. Each cluster is sequenced using fluorescently-labelled, reversible terminator nucleotides. With the addition of dNTPs, fluorescence is emitted and imaged. Before the next chemistry cycle proceeds, the blocked 3’ end and the fluorophore from each incorporated base is removed, to allow incorporation of the next base. Since all four reversible terminator-bound dNTPs are present during each sequencing cycle, natural competition minimises incorporation bias (Figure 1). The final result is obtained through a base-by-base sequencing.
As part of the routine at the OGC, quality checks of all the sequencing runs are carried out daily in order to guarantee the best performance for each processed project. The Quality Controls (QC) are namely accomplished through three main steps:
Monitoring the run and initial evaluation
During the run, metrics are shown on the Sequencing Analysis Viewer (SAV) interface (Figure 2, below). The SAV software provides an easy interface that allows monitoring of the main run parameters during and after the sequencing.
For the initial cycles the intensity values of the four incorporated bases can be assessed to determine if a clustering failure has occurred. No intensity (or very low intensities) would indicate a failure which can be due to issues with the library itself or reagent and technical problems (e.g. fluidic blockages, camera issues etc.).
The next metric to assess is the Cluster Density K/mm2 (represents the clusters for each tile in thousands per mm2). This value will be available on the SAV after cycle 5, 7 or 20 for HiSeq 2500, MiSeq (v3 kit) and MiSeq (v2 kit) platform, respectively. Successful results depend upon a correct prediction of how many clusters will be generated. An accurate prediction is dependent on the library type and method for quantifying. Overloading of the library may cause merging of clusters, and hence an increased density and lowering of the % PF with a reduction of the quality score (Q30). Conversely, a low loading input of library may result in low cluster density and insufficient data yield (Figure 3.1 and 3.2, below). Please note that this metric is not meaningful for the HiSeq 4000 platform (an ordered flow cell is utilised on this system so the Cluster Density K/mm2 is the same for all lanes).
After cycle 25 for all systems, some important further metrics for Read 1 will be available for quality control on the “summary” section of the SAV:
It is important to monitor the % Q-score >= Q30 and Error rate throughout the run to determine that the sequencing is proceeding as expected.
Preliminary assessment of the pool quality
Most runs performed use a library which contains a number of different samples pooled together. This approach is very common due to the existence of specific combinations of oligos (or tags) which can be used to uniquely identify samples. The samples are generally pooled equimolarly for each tag within a pool. During the run (after the index reads are completed), a report provides information about the counts of the index tags following “de-multiplexing”. This is performed by our bioinformatics pipeline for a single tile. Loss of data yield is expected to vary from 3-10% depending on library type and platform. However, loss of data can be higher if the indexing strategy is suboptimal, contamination occurred with other libraries, etc. The report also provides a bar chart in order to visually assess the success (balance) of pooling.
Main QC summary
A pre-installed software – the Real-time analysis (RTA) software – performs the primary analysis on all the Illumina platforms during the sequencing run. The RTA collects the temporary image files at every cycle from the machine and extracts the intensities by detecting the signal of the generated clusters. The intensity files are subsequently converted to base calls files (.bcf). During the secondary analysis, the base calls files are uploaded on the servers located at the WTCHG, and de-multiplexed into “fastq” files. The fastq files contain information about the sequence of the read and Q-scores for each base in all reads. At last, the fastq files will be converted to alignment files (BAM and bai) during mapping to the reference genome and generate the main QC summary statistics. As standard, either Stampy in combination with BWA7, or BWA mem, is used to map reads. The main QC file is a compendium of explanatory graphs and tables containing information for all the mapped or unmapped reads that are fundamental to evaluate the overall quality of a run. All the information is reported in graphs and tables containing values related to each pool belonging to a specific lane of the sequenced flow cell.
A table of the most important QC metrics and their meaning can be found below:
|Most important QC metrics and their meanings|
|G+C histogram||Describes the GC distribution of the reads (the graph will depend on the genome and library type being sequencing). The dotted line refers to all reads that have passed chastity filters whereas solid lines refers to only mapped reads.|
|Insert size histogram||The insert size distribution is summarized by the median and median absolute deviation. Note that this is for mapped reads and therefore may give questionable results if the % mapping is low.|
|Genomic coverage by G+C||The G+C fraction is computed from the reference genome, over the approximate fragment regions with coverage in the top 0.1 percentile being excluded. The dotted line shows the G+C histogram of the reference. This graph is useful to compare these in order to inspect for GC skewing of the library.|
|Fraction N/lowQ, read 1 (read 2)||Shows the fraction of N bases against the cycle number. Multiply the peak height by 100 to obtain the percentage of Ns at any given cycle. Please note that the scale of the X-axis varies substantially on this graph. Spikes of N bases below 2% are common and should be ignored and occasional spikes of 25-50% are expected.|
|G+C by cycle (PF), read 1 (read 2)||This graph can be used to see if there are any unexpected complexity issues or large scale GC skews at specific cycles.|
|Mean Q by cycle, read 1 (read 2)||The solid line is the numerical mean Q score and is a measure of the average information content per read. This is calculated using mapped reads only, whereas the dashed line in this plot uses all reads. This graph is useful for assessing the quality of sequencing throughout the read.|
|Q score histogram, read 1 (read 2)||Describes the dotted line information in the Mean Q by cycle graph as a histogram.|
|Yield (Mrd)||This describes the Yield in Mrd (million reads) for all the libraries in the pool.|
|Mean G+C||Shows the the overall GC% for each of the libraries. The colour blue indicates the value for read 1 and blue indicates read 2.|
Authors: Maria Lopopolo and Lorne Lonie