Tracing multiple paths through the human genome

For the first time, scientists at WTCHG have produced a method of analysing human genomes that takes into account variability between individuals rather than mapping high-throughput sequencing (HTS) data to a single reference genome. They have demonstrated that this method can substantially improve the accuracy of genome reconstruction, especially in regions that are known to be highly variable.

population reference graph
Example of a population reference graph

The goal of the original sequencing projects for humans and a variety of model organisms was to produce a single, linear reference that could be used to order the short DNA sequence ‘reads’ produced by HTS and to identify sequence variants such as single nucleotide polymorphisms (SNPs), insertions, deletions, duplications or other large changes. However, the choice of a reference genome for any species was effectively arbitrary, and studies over the past decade or more have greatly increased understanding of human genomic diversity.

Now Gil McVean, Zamin Iqbal, Alexander Dilthey and their colleagues have come up with an alternative approach that attempts to incorporate known variation by representing the many different possible ‘routes’ along the genome as a graph, known as a population reference graph. They focused on a region of the human genome least well served by the linear approach, the major histocompatibility complex (MHC) region on chromosome 6. This is among the most highly diverse regions in the human genome.

The team chose to represent departures from sequence identity as alternative paths based on multiple sources of information. They then developed algorithms that could map HTS data to this reference structure, first producing a ‘personalised reference’ or chromotype that picks a route through the graph for an individual, and then using variant-calling software to identify regions of new variation (represented as ‘bubbles’ in the graph).

Comparing this method with standard mapping techniques, they found that there were regions where genome inference was substantially improved. ‘This is a proof of concept’, says McVean, ‘but it offers a step towards representing the totality of human variation.’ The prototype has been made available to other researchers worldwide as a contribution to the Global Alliance for Genomics and Health data and tool sharing initiative.

McVean also points out that the method provides a framework for analysing complex genomes more generally, such as the microbial genomes that Iqbal is analysing in order to identify the sources and movement of infections in hospitals.

 

Alexander Dilthey, Charles Cox, Zamin Iqbal, Matthew Nelson and Gil McVean. Improved genome inference in the MHC using a population reference graph. Nature Genetics 2015, doi: 10.1038/ng.3527. Published online 27 April 2015.