A new genome sequence analysis programme developed by researchers at the WTCHG will make sequence data usable with greater speed, accuracy and efficiency.
In a recent paper, Gerton Lunter and Martin Goodson at the Centre describe how Stampy, a read mapping programme which they have developed, can provide researchers with an enhanced tool for sequence analysis, the first step in understanding genetic data.
High-throughput DNA sequencing machines are now the workhorses for many labs, not just those working in genetics. One of the most advanced machines, Illumina, can generate vast quantities of sequence data. 'Here at the Centre we have 7 Illumina machines', says Gerton Lunter. 'We can basically sequence the human genome in a week. Our machines are in high demand, in the Centre and more widely in the UK.'
Representation of the output of a read mapping programme from Illumina reads, showing the reference genome along the top and alignment of individual reads. Coloured squares represent the base or mismatches to the reference.
When researchers put their DNA samples through an Illumina machine, thousands of short sequences (‘reads') of between 50-100 base pairs in length are generated which need to be matched up to be informative. Each read comes from a random part of the genome, and attempting to fit all of these different pieces together would be like trying to solve a huge jigsaw puzzle. Instead, it is the job of specialised read mapping programmes to match sequence reads with the corresponding sequences in a reference genome.
Read mapping programmes can tolerate the odd mismatch between test and reference sequence. But where there are larger discrepancies between the two, for example in the case of naturally occurring insertions or duplications of sequence known as 'indels', the read mapping software struggles to deal with this.
Even more taxing for the programmes is when RNA rather than DNA has been sequenced. Researchers often use this approach as they can hone in on sequences that may be of more interest to them. But there is a problem with this. A process known as splicing, when bits of DNA are cut out of the transcribed RNA before it is turned into protein, means that the DNA from which the RNA is derived may be from widely separated regions on the genome, stretching a read mapping programme's ability to find the 'home' sequence.
The inadequacy of existing read mapping software in handling these issues persuaded Dr Lunter that better software was needed. 'I started working on the problem around two and a half years ago when read mappers weren't really able to deal with these kinds of data', he says. 'I thought I'd quickly make something, but as always, it takes longer than you think.'
The tests that the researchers put Stampy through indicate that it rises to the challenges much better than other programmes do. 'Although we tested some real data , a lot of the tests are based on simulated data. With this you know exactly where it came from, you add the changes yourself, and then you ask the programmes to figure out what you did', explains Dr Lunter. They found that Stampy out-performed the existing programmes in speed, accuracy, and proportion of sequence data that it was able to use.
'I think that this programme will be very useful', comments Dr Lunter who envisages developing it further. 'Our role is to provide the tools that people need, so if there is sufficient interest both inside and outside the Centre to add features to this programme, then I think we should respond to that.'
Stampy is available at www.well.ox.ac.uk/project-stampy. It is free for academic and non-profit use but is not open-source.
Lunter, G. and Goodson, M. (2010) Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Research, October 27
For more information on Bioinformatics and Statistical Genetics, click here.