We are interested in understanding how genomes evolve through mutations and evolutionary pressures, and use this understanding to inform research ranging from disease to human ancestry.
We try to understand non-coding functional DNA by analyzing functional genomic data through deep neural networks, to classify tumors using genome-wide signatures obtained from sequencing tumor samples, to describe past demographic events such as population splits, bottlenecks and migrations by analysing modern and ancient whole-genome sequences, and to understand the three-dimensional structure of DNA facilitating interactions between features far apart on the linear genome. We also develop software tools, such as the variant caller Octopus that accurately calls and phases a range of variant types, and we develop methods for decoding Oxford Nanopore long-read sequencing data.
We draw on a range of techniques, mostly from a field known as machine learning. A few examples of tools we use include Bayesian inference to deal with uncertainty in the data, neural networks to learn from complex data in an unsupervised way, particle filters to do inference on complex models, latent Dirichlet analysis to model the hidden structure of data, and computational techniques such as the Burrows-Wheeler transform combining huge compression factors with fast lookups, enabling the analysis of large data sets.
Some projects are described in more detail below.
Gerton Lunter has a PhD in maths, and has worked in bioinformatics since 2002. He has contributed to the 1000 Genomes Project and various primate sequencing projects. In 2014 he co-founded Genomics plc, which analyses large-scale genomic data sets to speed up drug discovery. He divides his time between the company and his research group.
Neural network finds motif determining recombination hotspots
In humans, recombinations do not happen just anywhere in the genome but are focused in so-called recombination hotspots. An analysis identified a 13 bp motif that is enriched in hotspots, which was later shown to be the binding motif of a protein PRDM9 that drives recombination. Recently, the motif was characterized experimentally to high resolution, and a second motif was identified. We set ourselves the goal of identifying these motifs from hotspot data only, using a neural network model. By extending a variational Bayes technique to reduce over-fitting, and making the model respect the reverse-complement symmetry due to the double-stranded nature of DNA, we were able to infer the PRDM9 binding motif to very high resolution.
Image: Applying a neural network to the task of classifying recombination hotspots and coldspots identifies two motifs to high resolution. One is the classical CCNCCNTNNCCNC motif associated to PRDM9, the other is a novel motif that was recently shown to represent the binding motif of the second half of the PRDM9 zinc-finger DNA binding domain (Altemose et al., eLife 2017;6:e28383
Inference of demography
The pattern of mutations across genomes in a population is shaped by the demographic history of that population, such as the population bottlenecks it has experienced during evolution. While this process is well understood, inferring back demography from observed mutations is complex, as their relationship is highly nonlinear, and single mutations carry exceedingly little information about demographic events. Using a powerful statistical inference technique termed particle filters applied to the genomes of 4 diploid individuals we were able to infer the population size over a wide range of evolutionary times.
Image: Effective population sizes across time for Central European, Chinese (from Beijing) and African (Nigerian) populations. Bottlenecks due to the Out-of-Africa event are seen in the European and Chinese, but not in the African population.
Building on experience with an earlier tool called Platypus, we have developed a new haplotype-based variant caller Octopus, aimed at accurately detecting both germline and somatic tumor variants from short-read sequencing data. Using Variational Bayes to infer haplotypes and their frequency, it is more accurate than the current state-of-art tool GATK, and is able to accurately identify indel mutations. Novel features include the detection of short inversions, which are more numerous than thought, and the ability to phase mutations.
Image: Comparison of sensitivity and specificity on the Genome in a Bottle (GIAB) validation set.