Neural network finds motif determining recombination hotspots
In humans, recombinations do not happen just anywhere in the genome but are focused in so-called recombination hotspots. An analysis identified a 13 bp motif that is enriched in hotspots, which was later shown to be the binding motif of a protein PRDM9 that drives recombination. Recently, the motif was characterized experimentally to high resolution, and a second motif was identified. We set ourselves the goal of identifying these motifs from hotspot data only, using a neural network model. By extending a variational Bayes technique to reduce over-fitting, and making the model respect the reverse-complement symmetry due to the double-stranded nature of DNA, we were able to infer the PRDM9 binding motif to very high resolution.
Image: Applying a neural network to the task of classifying recombination hotspots and coldspots identifies two motifs to high resolution. One is the classical CCNCCNTNNCCNC motif associated to PRDM9, the other is a novel motif that was recently shown to represent the binding motif of the second half of the PRDM9 zinc-finger DNA binding domain (Altemose et al., eLife 2017;6:e28383
Inference of demography
The pattern of mutations across genomes in a population is shaped by the demographic history of that population, such as the population bottlenecks it has experienced during evolution. While this process is well understood, inferring back demography from observed mutations is complex, as their relationship is highly nonlinear, and single mutations carry exceedingly little information about demographic events. Using a powerful statistical inference technique termed particle filters applied to the genomes of 4 diploid individuals we were able to infer the population size over a wide range of evolutionary times.
Image: Effective population sizes across time for Central European, Chinese (from Beijing) and African (Nigerian) populations. Bottlenecks due to the Out-of-Africa event are seen in the European and Chinese, but not in the African population.
Building on experience with an earlier tool called Platypus, we have developed a new haplotype-based variant caller Octopus, aimed at accurately detecting both germline and somatic tumor variants from short-read sequencing data. Using Variational Bayes to infer haplotypes and their frequency, it is more accurate than the current state-of-art tool GATK, and is able to accurately identify indel mutations. Novel features include the detection of short inversions, which are more numerous than thought, and the ability to phase mutations.
Image: Comparison of sensitivity and specificity on the Genome in a Bottle (GIAB) validation set.