Nearly all our cells contain two copies of every chromosome, one from each parent. Detecting the differences between the two and how they compare with the genomes of the rest of our species, is of fundamental importance to medical genetics. Methods for detecting human genetic mutations ("variants") from genome sequence data are typically based on the principle that all genomes are almost exactly the same, hence sequences can be 'mapped' against an existing reference. This has proven very effective for finding mutations that are very simple, for example where just a few letters of the DNA code have changed. However, this approach is blind to a range of types of variant, including complex rearrangements of DNA and larger insertions or deletions of sequence. Furthermore, there are regions of the genome of medical importance, such as the Major Histocompatibility Complex, which plays key roles in the immune system, autoimmunity, tissue transplantation and susceptibility to infectious disease, where the genome has such a complex history of mutation and rearrangement that mapping approaches are unreliable.
Accessing such parts of the genome requires new methods, in particular the assembly of genomes from sequence data. However until now, the sheer scale and complexity of modern genome sequencing, which can generate over 100 billion pieces of information about a single human genome, has been too daunting for 'de novo assembly' algorithms. But in a paper published today in Nature Genetics, Zam Iqbal from the Wellcome Trust Centre for Human Genetics in Oxford and Mario Caccamo from The Genome Analysis Centre in Norwich, along with colleagues, have solved this problem. What's more, they have solved it not just for one genome, but for many genomes simultaneously, by developing novel mathematical structures called coloured de Bruijn graphs. Zam and Mario, respectively a mathematician and a computer scientist by training, initially met while working at the European Bioinformatics Institute near Cambridge. After finding a mutual interest in de novo assembly, they developed an elegant solution to the computer memory problems that had beset earlier attempts, while stuck in an airport lounge. This was to be the foundation of their new de novo assembler, although it took several years to fully develop the software and associated statistical models.
The new algorithms have already started to reveal new biological insights. For example, joint assembly of over 150 genomes from the 1000 Genomes Project has shown that each person has, on average, over 1.4 million DNA letters in sequence that is highly divergent from anything found in the reference, of which 45,000 letters lie within genes. In other words, not only have they found genetic variation that is invisible to methods based on the human reference genome, but a significant amount contains gene sequence and therefore has a biological function.
The method is already being used in a wide series of collaborative studies within Oxford and beyond and the software, called Cortex, has been made freely available.
"What's exciting about this method is that it opens up new ways of looking at how genomes change, from the development of cancerous tumours to the study of drug resistance in bacteria", says Professor Gil McVean, who has supervised the project since Zam moved to work in his Oxford-based group two years ago.
"Efficient tools such as Cortex will make the difference when it comes to making sense of the large sequence datasets that are being generated to study a wide range of biological systems, to link information about genome variation to important traits. At The Genome Analysis Centre we are working with complex genomes such as bread wheat, for which trait-specific information will be key for implementation of effective breeding programmes and ultimately contribute to the UK aims for food security", says Professor Jane Rogers, Director of The Genome Analysis Centre where Mario leads the bioinformatics division.
"Before, we could assemble and analyse the genome of an individual. Cortex is a mathematically sophisticated tool that allows you to compare genomes within a whole population while still maintaining the information about the individuals in it. It's a whole new way to analyse genomes," said Paul Flicek of the EMBL-European Bioinfomatics Institute.
This work was funded by the Wellcome Trust. TGAC and Mario Caccamo receive strategic funding from the BBSRC.
For more information on Cortex see the paper itself: http://dx.doi.org/10.1038/ng.1028
and the website: http://cortexassembler.sourceforge.net/.
For more information on the Wellcome Trust Centre for Human Genetics, see www.well.ox.ac.uk.
For more information on TGAC see http://www.tgac.ac.uk/.
For more information on the EBI see www.ebi.ac.uk.