Professor Gerton Lunter

Research Area: Bioinformatics & Stats (inc. Modelling and Computational Biology)
Technology Exchange: Bioinformatics, Computational biology and Statistical genetics
Scientific Themes: Genetics & Genomics and Immunology & Infectious Disease
Keywords: Population genetics, Non-coding functional genomics, Statistical modeling and Machine Learning
Web Links:

It is now straightforward to sequence the DNA in a person's genome, and databases that link genetic data to a range of phenotypes are becoming ever larger.  What is less straightforward is to process and interpret these data.  My group is interested in developing computational and statistical methods to help use these growing resrouces to answer questions in medical genomics and population genetics.

The questions we address range from data processing to interpretation of genetic variants and understanding the history our species.  This range is reflected in the variety of projects we take on, which recently have included:

  • improving the accuracy of reads from the Oxford Nanopore (ONT) single-moledule portable sequencing device;
  • Inferring demographic events such as migrations and population bottlenecks from whole genome sequencing data;
  • Understand the impact of non-coding mutations on disease by building sequence-to-phenotype models;
  • Charting the differentiation of B cells in response to vaccination and infection.

We draw on a range of sources for our methods, but key recurring ingredients are Bayesian statistics, machine learning, and algorithm design. We are particularly interested in the application of deep learning methods, such as convolutional neural networks, and in particular for the interpretation of non-coding mutations these methods show a lot of promise.  We also use a range of more traditional machine-learning methods, such as Bayesian statistics, hidden Markov models and particle filters, and design novel algorithms, such as based around the Burrows-Wheeler transform, to deal with the often very large data sets.

Name Department Institution Country
Professor Nazneen Rahman FMedSci FRCP Division of Genetics and Epidemiology Institute of Cancer Research United Kingdom
Dr Tom Hart Zoology University of Oxford United Kingdom
Professor Dominic Kelly NDM University of Oxford United Kingdom
Professor Gil McVean FRS FMedSci Big Data Institute Oxford University, Henry Wellcome Building of Genomic Medicine United Kingdom
Hoehn KB, Lunter G, Pybus OG. 2017. A Phylogenetic Codon Substitution Model for Antibody Lineages. Genetics, 206 (1), pp. 417-427. | Show Abstract | Read more

Phylogenetic methods have shown promise in understanding the development of broadly neutralizing antibody lineages (bNAbs). However, the mutational process that generates these lineages, somatic hypermutation, is biased by hotspot motifs which violates important assumptions in most phylogenetic substitution models. Here, we develop a modified GY94-type substitution model that partially accounts for this context dependency while preserving independence of sites during calculation. This model shows a substantially better fit to three well-characterized bNAb lineages than the standard GY94 model. We also demonstrate how our model can be used to test hypotheses concerning the roles of different hotspot and coldspot motifs in the evolution of B-cell lineages. Further, we explore the consequences of the idea that the number of hotspot motifs, and perhaps the mutation rate in general, is expected to decay over time in individual bNAb lineages.

Fowler A, Mahamdallie S, Ruark E, Seal S, Ramsay E, Clarke M, Uddin I, Wylie H, Strydom A, Lunter G, Rahman N. 2016. Accurate clinical detection of exon copy number variants in a targeted NGS panel using DECoN. Wellcome Open Res, 1 pp. 20. | Show Abstract | Read more

Background: Targeted next generation sequencing (NGS) panels are increasingly being used in clinical genomics to increase capacity, throughput and affordability of gene testing. Identifying whole exon deletions or duplications (termed exon copy number variants, 'exon CNVs') in exon-targeted NGS panels has proved challenging, particularly for single exon CNVs.  Methods: We developed a tool for the Detection of Exon Copy Number variants (DECoN), which is optimised for analysis of exon-targeted NGS panels in the clinical setting. We evaluated DECoN performance using 96 samples with independently validated exon CNV data. We performed simulations to evaluate DECoN detection performance of single exon CNVs and to evaluate performance using different coverage levels and sample numbers. Finally, we implemented DECoN in a clinical laboratory that tests BRCA1 and BRCA2 with the TruSight Cancer Panel (TSCP). We used DECoN to analyse 1,919 samples, validating exon CNV detections by multiplex ligation-dependent probe amplification (MLPA).  Results: In the evaluation set, DECoN achieved 100% sensitivity and 99% specificity for BRCA exon CNVs, including identification of 8 single exon CNVs. DECoN also identified 14/15 exon CNVs in 8 other genes. Simulations of all possible BRCA single exon CNVs gave a mean sensitivity of 98% for deletions and 95% for duplications. DECoN performance remained excellent with different levels of coverage and sample numbers; sensitivity and specificity was >98% with the typical NGS run parameters. In the clinical pipeline, DECoN automatically analyses pools of 48 samples at a time, taking 24 minutes per pool, on average. DECoN detected 24 BRCA exon CNVs, of which 23 were confirmed by MLPA, giving a false discovery rate of 4%. Specificity was 99.7%.  Conclusions: DECoN is a fast, accurate, exon CNV detection tool readily implementable in research and clinical NGS pipelines. It has high sensitivity and specificity and acceptable false discovery rate. DECoN is freely available at www.icr.ac.uk/decon.

Galson JD, Trück J, Clutterbuck EA, Fowler A, Cerundolo V, Pollard AJ, Lunter G, Kelly DF. 2016. Erratum to: B-cell repertoire dynamics after sequential hepatitis B vaccination and evidence for cross-reactive B-cell activation. Genome Med, 8 (1), pp. 81. | Show Abstract | Read more

© 2016 The Author(s). It has come to our attention that there was an omission in the Acknowledgements section in this article [1]. The Acknowledgements section should read: The authors are grateful to the study participants, to the doctors and nurses at the Oxford Vaccine Group for assisting with sample collection, and to the National Institute for Health Research Clinical Research Network. The authors thank Craig Waugh for help with cell sorting and the High-Throughput Genomics Group at the Wellcome Trust Centre for Human Genetics (subsidized by Wellcome Trust grant reference 090532/Z/09/Z) for the generation of sequencing data. Purified HBsAg was provided by GlaxoSmithKline Biologicals SA, and conjugated to APC by Miltenyi Biotec.

Hoehn KB, Fowler A, Lunter G, Pybus OG. 2016. The Diversity and Molecular Evolution of B-Cell Receptors during Infection. Mol Biol Evol, 33 (5), pp. 1147-1157. | Show Abstract | Read more

B-cell receptors (BCRs) are membrane-bound immunoglobulins that recognize and bind foreign proteins (antigens). BCRs are formed through random somatic changes of germline DNA, creating a vast repertoire of unique sequences that enable individuals to recognize a diverse range of antigens. After encountering antigen for the first time, BCRs undergo a process of affinity maturation, whereby cycles of rapid somatic mutation and selection lead to improved antigen binding. This constitutes an accelerated evolutionary process that takes place over days or weeks. Next-generation sequencing of the gene regions that determine BCR binding has begun to reveal the diversity and dynamics of BCR repertoires in unprecedented detail. Although this new type of sequence data has the potential to revolutionize our understanding of infection dynamics, quantitative analysis is complicated by the unique biology and high diversity of BCR sequences. Models and concepts from molecular evolution and phylogenetics that have been applied successfully to rapidly evolving pathogen populations are increasingly being adopted to study BCR diversity and divergence within individuals. However, BCR dynamics may violate key assumptions of many standard evolutionary methods, as they do not descend from a single ancestor, and experience biased mutation. Here, we review the application of evolutionary models to BCR repertoires and discuss the issues we believe need be addressed for this interdisciplinary field to flourish.

Ruark E, Münz M, Clarke M, Renwick A, Ramsay E, Elliott A, Seal S, Lunter G, Rahman N. 2016. OpEx - a validated, automated pipeline optimised for clinical exome sequence analysis. Sci Rep, 6 (1), pp. 31029. | Show Abstract | Read more

We present an easy-to-use, open-source Optimised Exome analysis tool, OpEx (http://icr.ac.uk/opex) that accurately detects small-scale variation, including indels, to clinical standards. We evaluated OpEx performance with an experimentally validated dataset (the ICR142 NGS validation series), a large 1000 exome dataset (the ICR1000 UK exome series), and a clinical proband-parent trio dataset. The performance of OpEx for high-quality base substitutions and short indels in both small and large datasets is excellent, with overall sensitivity of 95%, specificity of 97% and low false detection rate (FDR) of 3%. Depending on the individual performance requirements the OpEx output allows one to optimise the inevitable trade-offs between sensitivity and specificity. For example, in the clinical setting one could permit a higher FDR and lower specificity to maximise sensitivity. In contexts where experimental validation is not possible, minimising the FDR and improving specificity may be a preferable trade-off for slightly lower sensitivity. OpEx is simple to install and use; the whole pipeline is run from a single command. OpEx is therefore well suited to the increasing research and clinical laboratories undertaking exome sequencing, particularly those without in-house dedicated bioinformatics expertise.

Galson JD, Trück J, Clutterbuck EA, Fowler A, Cerundolo V, Pollard AJ, Lunter G, Kelly DF. 2016. B-cell repertoire dynamics after sequential hepatitis B vaccination and evidence for cross-reactive B-cell activation. Genome Med, 8 (1), pp. 68. | Show Abstract | Read more

BACKGROUND: A diverse B-cell repertoire is essential for recognition and response to infectious and vaccine antigens. High-throughput sequencing of B-cell receptor (BCR) genes can now be used to study the B-cell repertoire at great depth and may shed more light on B-cell responses than conventional immunological methods. Here, we use high-throughput BCR sequencing to provide novel insight into B-cell dynamics following a primary course of hepatitis B vaccination. METHODS: Nine vaccine-naïve participants were administered three doses of hepatitis B vaccine (months 0, 1, and 2 or 7). High-throughput Illumina sequencing of the total BCR repertoire was combined with targeted sequencing of sorted vaccine antigen-enriched B cells to analyze the longitudinal response of both the total and vaccine-specific repertoire after each vaccine. ELISpot was used to determine vaccine-specific cell numbers following each vaccine. RESULTS: Deconvoluting the vaccine-specific from total BCR repertoire showed that vaccine-specific sequence clusters comprised <0.1 % of total sequence clusters, and had certain stereotypic features. The vaccine-specific BCR sequence clusters were expanded after each of the three vaccine doses, despite no vaccine-specific B cells being detected by ELISpot after the first vaccine dose. These vaccine-specific BCR clusters detected after the first vaccine dose had distinct properties compared to those detected after subsequent doses; they were more mutated, present at low frequency even prior to vaccination, and appeared to be derived from more mature B cells. CONCLUSIONS: These results demonstrate the high-sensitivity of our vaccine-specific BCR analysis approach and suggest an alternative view of the B-cell response to novel antigens. In the response to the first vaccine dose, many vaccine-specific BCR clusters appeared to largely derive from previously activated cross-reactive B cells that have low affinity for the vaccine antigen, and subsequent doses were required to yield higher affinity B cells.

Galson JD, Trück J, Fowler A, Clutterbuck EA, Münz M, Cerundolo V, Reinhard C, van der Most R, Pollard AJ, Lunter G, Kelly DF. 2015. Analysis of B Cell Repertoire Dynamics Following Hepatitis B Vaccination in Humans, and Enrichment of Vaccine-specific Antibody Sequences. EBioMedicine, 2 (12), pp. 2070-2079. | Show Abstract | Read more

Generating a diverse B cell immunoglobulin repertoire is essential for protection against infection. The repertoire in humans can now be comprehensively measured by high-throughput sequencing. Using hepatitis B vaccination as a model, we determined how the total immunoglobulin sequence repertoire changes following antigen exposure in humans, and compared this to sequences from vaccine-specific sorted cells. Clonal sequence expansions were seen 7 days after vaccination, which correlated with vaccine-specific plasma cell numbers. These expansions caused an increase in mutation, and a decrease in diversity and complementarity-determining region 3 sequence length in the repertoire. We also saw an increase in sequence convergence between participants 14 and 21 days after vaccination, coinciding with an increase of vaccine-specific memory cells. These features allowed development of a model for in silico enrichment of vaccine-specific sequences from the total repertoire. Identifying antigen-specific sequences from total repertoire data could aid our understanding B cell driven immunity, and be used for disease diagnostics and vaccine evaluation.

1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR. 2015. A global reference for human genetic variation. Nature, 526 (7571), pp. 68-74. | Show Abstract | Read more

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

Taylor JC, Martin HC, Lise S, Broxholme J, Cazier JB, Rimmer A, Kanapin A, Lunter G, Fiddy S, Allan C et al. 2015. Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nat Genet, 47 (7), pp. 717-726. | Show Abstract | Read more

To assess factors influencing the success of whole-genome sequencing for mainstream clinical diagnosis, we sequenced 217 individuals from 156 independent cases or families across a broad spectrum of disorders in whom previous screening had identified no pathogenic variants. We quantified the number of candidate variants identified using different strategies for variant calling, filtering, annotation and prioritization. We found that jointly calling variants across samples, filtering against both local and external databases, deploying multiple annotation tools and using familial transmission above biological plausibility contributed to accuracy. Overall, we identified disease-causing variants in 21% of cases, with the proportion increasing to 34% (23/68) for mendelian disorders and 57% (8/14) in family trios. We also discovered 32 potentially clinically actionable variants in 18 genes unrelated to the referral disorder, although only 4 were ultimately considered reportable. Our results demonstrate the value of genome sequencing for routine clinical diagnosis but also highlight many outstanding challenges.

Galson JD, Clutterbuck EA, Trück J, Ramasamy MN, Münz M, Fowler A, Cerundolo V, Pollard AJ, Lunter G, Kelly DF. 2015. BCR repertoire sequencing: different patterns of B-cell activation after two Meningococcal vaccines. Immunol Cell Biol, 93 (10), pp. 885-895. | Show Abstract | Read more

Next-generation sequencing was used to investigate the B-cell receptor heavy chain transcript repertoire of different B-cell subsets (naive, marginal zone (MZ), immunoglobulin M (IgM) memory and IgG memory) at baseline, and of plasma cells (PCs) 7 days following administration of serogroup ACWY meningococcal polysaccharide and protein-polysaccharide conjugate vaccines. Baseline B-cell subsets could be distinguished from each other using a small number of repertoire properties (clonality, mutation from germline and complementarity-determining region 3 (CDR3) length) that were conserved between individuals. However, analyzing the CDR3 amino-acid sequence (which is particularly important for antigen binding) of the baseline subsets showed few sequences shared between individuals. In contrast, day 7 PCs demonstrated nearly 10-fold greater sequence sharing between individuals than the baseline subsets, consistent with the PCs being induced by the vaccine antigen and sharing specificity for a more limited range of epitopes. By annotating PC sequences based on IgG subclass usage and mutation, and also comparing them with the sequences of the baseline cell subsets, we were able to identify different signatures after the polysaccharide and conjugate vaccines. PCs produced after conjugate vaccination were predominantly IgG1, and most related to IgG memory cells. In contrast, after polysaccharide vaccination, the PCs were predominantly IgG2, less mutated and were equally likely to be related to MZ, IgM memory or IgG memory cells. High-throughput B-cell repertoire sequencing thus provides a unique insight into patterns of B-cell activation not possible from more conventional measures of immunogenicity.

Staab PR, Zhu S, Metzler D, Lunter G. 2015. scrm: efficiently simulating long sequences using the approximated coalescent with recombination. Bioinformatics, 31 (10), pp. 1680-1682. | Show Abstract | Read more

MOTIVATION: Coalescent-based simulation software for genomic sequences allows the efficient in silico generation of short- and medium-sized genetic sequences. However, the simulation of genome-size datasets as produced by next-generation sequencing is currently only possible using fairly crude approximations. RESULTS: We present the sequential coalescent with recombination model (SCRM), a new method that efficiently and accurately approximates the coalescent with recombination, closing the gap between current approximations and the exact model. We present an efficient implementation and show that it can simulate genomic-scale datasets with an essentially correct linkage structure.

Galson JD, Clutterbuck EA, Trück J, Ramasamy MN, Münz M, Fowler A, Cerundolo V, Pollard AJ, Lunter G, Kelly DF. 2015. BCR repertoire sequencing: Different patterns of B-cell activation after two Meningococcal vaccines Immunology and Cell Biology, 93 (10), pp. 885-895. | Show Abstract | Read more

© 2015 Australasian Society for Immunology Inc. All rights reserved. Next-generation sequencing was used to investigate the B-cell receptor heavy chain transcript repertoire of different B-cell subsets (naive, marginal zone (MZ), immunoglobulin M (IgM) memory and IgG memo ry) at baseline, and of plasma cells (PCs) 7 days following administration of serogroup ACWY meningococcal polysaccharide and protein-polysaccharide conjugate vaccines. Baseline B-cell subsets could be distinguished from each other using a small number of repertoire properties (clonality, mutation from germline and complementarity-determining region 3 (CDR3) length) that were conserved between individuals. However, analyzing the CDR3 amino-acid sequence (which is particularly important for antigen binding) of the baseline subsets showed few sequences shared between individuals. In contrast, day 7 PCs demonstrated nearly 10-fold greater sequence sharing between individuals than the baseline subsets, consistent with the PCs being induced by the vaccine antigen and sharing specificity for a more limited range of epitopes. By annotating PC sequences based on IgG subclass usage and mutation, and also comparing them with the sequences of the baseline cell subsets, we were able to identify different signatures after the polysaccharide and conjugate vaccines. PCs produced after conjugate vaccination were predominantly IgG1, and most related to IgG memory cells. In contrast, after polysaccharide vaccination, the PCs were predominantly IgG2, less mutated and were equally likely to be related to MZ, IgM memory or IgG memory cells. High-throughput B-cell repertoire sequencing thus provides a unique insight into patterns of B-cell activation not possible from more conventional measures of immunogenicity.

Galson JD, Trück J, Fowler A, Münz M, Cerundolo V, Pollard AJ, Lunter G, Kelly DF. 2015. In-Depth Assessment of Within-Individual and Inter-Individual Variation in the B Cell Receptor Repertoire. Front Immunol, 6 (OCT), pp. 531. | Show Abstract | Read more

High-throughput sequencing of the B cell receptor (BCR) repertoire can provide rapid characterization of the B cell response in a wide variety of applications in health, after vaccination and in infectious, inflammatory and immune-driven disease, and is starting to yield clinical applications. However, the interpretation of repertoire data is compromised by a lack of studies to assess the intra and inter-individual variation in the BCR repertoire over time in healthy individuals. We applied a standardized isotype-specific BCR repertoire deep sequencing protocol to a single highly sampled participant, and then evaluated the method in 9 further participants to comprehensively describe such variation. We assessed total repertoire metrics of mutation, diversity, VJ gene usage and isotype subclass usage as well as tracking specific BCR sequence clusters. There was good assay reproducibility (both in PCR amplification and biological replicates), but we detected striking fluctuations in the repertoire over time that we hypothesize may be due to subclinical immune activation. Repertoire properties were unique for each individual, which could partly be explained by a decrease in IgG2 with age, and genetic differences at the immunoglobulin locus. There was a small repertoire of public clusters (0.5, 0.3, and 1.4% of total IgA, IgG, and IgM clusters, respectively), which was enriched for expanded clusters containing sequences with suspected specificity toward antigens that should have been historically encountered by all participants through prior immunization or infection. We thus provide baseline BCR repertoire information that can be used to inform future study design, and aid in interpretation of results from these studies. Furthermore, our results indicate that BCR repertoire studies could be used to track changes in the public repertoire in and between populations that might relate to population immunity against infectious diseases, and identify the characteristics of inflammatory and immunological diseases.

Münz M, Ruark E, Renwick A, Ramsay E, Clarke M, Mahamdallie S, Cloke V, Seal S, Strydom A, Lunter G, Rahman N. 2015. CSN and CAVA: variant annotation tools for rapid, robust next-generation sequencing analysis in the clinical setting. Genome Med, 7 (1), pp. 76. | Show Abstract | Read more

BACKGROUND: Next-generation sequencing (NGS) offers unprecedented opportunities to expand clinical genomics. It also presents challenges with respect to integration with data from other sequencing methods and historical data. Provision of consistent, clinically applicable variant annotation of NGS data has proved difficult, particularly of indels, an important variant class in clinical genomics. Annotation in relation to a reference genome sequence, the DNA strand of coding transcripts and potential alternative variant representations has not been well addressed. Here we present tools that address these challenges to provide rapid, standardized, clinically appropriate annotation of NGS data in line with existing clinical standards. METHODS: We developed a clinical sequencing nomenclature (CSN), a fixed variant annotation consistent with the principles of the Human Genome Variation Society (HGVS) guidelines, optimized for automated variant annotation of NGS data. To deliver high-throughput CSN annotation we created CAVA (Clinical Annotation of VAriants), a fast, lightweight tool designed for easy incorporation into NGS pipelines. CAVA allows transcript specification, appropriately accommodates the strand of a gene transcript and flags variants with alternative annotations to facilitate clinical interpretation and comparison with other datasets. We evaluated CAVA in exome data and a clinical BRCA1/BRCA2 gene testing pipeline. RESULTS: CAVA generated CSN calls for 10,313,034 variants in the ExAC database in 13.44 hours, and annotated the ICR1000 exome series in 6.5 hours. Evaluation of 731 different indels from a single individual revealed 92 % had alternative representations in left aligned and right aligned data. Annotation of left aligned data, as performed by many annotation tools, would thus give clinically discrepant annotation for the 339 (46 %) indels in genes transcribed from the forward DNA strand. By contrast, CAVA provides the correct clinical annotation for all indels. CAVA also flagged the 370 indels with alternative representations of a different functional class, which may profoundly influence clinical interpretation. CAVA annotation of 50 BRCA1/BRCA2 gene mutations from a clinical pipeline gave 100 % concordance with Sanger data; only 8/25 BRCA2 mutations were correctly clinically annotated by other tools. CONCLUSIONS: CAVA is a freely available tool that provides rapid, robust, high-throughput clinical annotation of NGS data, using a standardized clinical sequencing nomenclature.

Galson JD, Clutterbuck EA, Trueck J, Muenz M, Fowler A, Cerundolo V, Pollard AJ, Lunter G, Kelly DF. 2014. Plasma cell antibody repertoire analysis following administration of meningococcal polysaccharide and protein-polysaccharide conjugate vaccines: evidence of distinct patterns of B cell activation IMMUNOLOGY, 143 pp. 62-62.

Trück J, Ramasamy MN, Galson JD, Rance R, Parkhill J, Lunter G, Pollard AJ, Kelly DF. 2015. Identification of antigen-specific B cell receptor sequences using public repertoire analysis. J Immunol, 194 (1), pp. 252-261. | Show Abstract | Read more

High-throughput sequencing allows detailed study of the BCR repertoire postimmunization, but it remains unclear to what extent the de novo identification of Ag-specific sequences from the total BCR repertoire is possible. A conjugate vaccine containing Haemophilus influenzae type b (Hib) and group C meningococcal polysaccharides, as well as tetanus toxoid (TT), was used to investigate the BCR repertoire of adult humans following immunization and to test the hypothesis that public or convergent repertoire analysis could identify Ag-specific sequences. A number of Ag-specific BCR sequences have been reported for Hib and TT, which made a vaccine containing these two Ags an ideal immunological stimulus. Analysis of identical CDR3 amino acid sequences that were shared by individuals in the postvaccine repertoire identified a number of known Hib-specific sequences but only one previously described TT sequence. The extension of this analysis to nonidentical, but highly similar, CDR3 amino acid sequences revealed a number of other TT-related sequences. The anti-Hib avidity index postvaccination strongly correlated with the relative frequency of Hib-specific sequences, indicating that the postvaccination public BCR repertoire may be related to more conventional measures of immunogenicity correlating with disease protection. Analysis of public BCR repertoire provided evidence of convergent BCR evolution in individuals exposed to the same Ags. If this finding is confirmed, the public repertoire could be used for rapid and direct identification of protective Ag-specific BCR sequences from peripheral blood.

Majithia AR, Flannick J, Shahinian P, Guo M, Bray MA, Fontanillas P, Gabriel SB, GoT2D Consortium, NHGRI JHS/FHS Allelic Spectrum Project, SIGMA T2D Consortium et al. 2014. Rare variants in PPARG with decreased activity in adipocyte differentiation are associated with increased risk of type 2 diabetes. Proc Natl Acad Sci U S A, 111 (36), pp. 13127-13132. | Show Abstract | Read more

Peroxisome proliferator-activated receptor gamma (PPARG) is a master transcriptional regulator of adipocyte differentiation and a canonical target of antidiabetic thiazolidinedione medications. In rare families, loss-of-function (LOF) mutations in PPARG are known to cosegregate with lipodystrophy and insulin resistance; in the general population, the common P12A variant is associated with a decreased risk of type 2 diabetes (T2D). Whether and how rare variants in PPARG and defects in adipocyte differentiation influence risk of T2D in the general population remains undetermined. By sequencing PPARG in 19,752 T2D cases and controls drawn from multiple studies and ethnic groups, we identified 49 previously unidentified, nonsynonymous PPARG variants (MAF < 0.5%). Considered in aggregate (with or without computational prediction of functional consequence), these rare variants showed no association with T2D (OR = 1.35; P = 0.17). The function of the 49 variants was experimentally tested in a novel high-throughput human adipocyte differentiation assay, and nine were found to have reduced activity in the assay. Carrying any of these nine LOF variants was associated with a substantial increase in risk of T2D (OR = 7.22; P = 0.005). The combination of large-scale DNA sequencing and functional testing in the laboratory reveals that approximately 1 in 1,000 individuals carries a variant in PPARG that reduces function in a human adipocyte differentiation assay and is associated with a substantial risk of T2D.

Rimmer A, Phan H, Mathieson I, Iqbal Z, Twigg SRF, WGS500 Consortium, Wilkie AOM, McVean G, Lunter G. 2014. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat Genet, 46 (8), pp. 912-918. | Show Abstract | Read more

High-throughput DNA sequencing technology has transformed genetic research and is starting to make an impact on clinical practice. However, analyzing high-throughput sequencing data remains challenging, particularly in clinical settings where accuracy and turnaround times are critical. We present a new approach to this problem, implemented in a software package called Platypus. Platypus achieves high sensitivity and specificity for SNPs, indels and complex polymorphisms by using local de novo assembly to generate candidate variants, followed by local realignment and probabilistic haplotype estimation. It is an order of magnitude faster than existing tools and generates calls from raw aligned read data without preprocessing. We demonstrate the performance of Platypus in clinically relevant experimental designs by comparing with SAMtools and GATK on whole-genome and exome-capture data, by identifying de novo variation in 15 parent-offspring trios with high sensitivity and specificity, and by estimating human leukocyte antigen genotypes directly from variant calls.

Petousi N, Copley RR, Lappin TR, Haggan SE, Bento CM, Cario H, Percy MJ, WGS Consortium, Ratcliffe PJ, Robbins PA, McMullin MF. 2014. Erythrocytosis associated with a novel missense mutation in the BPGM gene. Haematologica, 99 (10), pp. e201-e204. | Read more

Rands CM, Meader S, Ponting CP, Lunter G. 2014. 8.2% of the Human genome is constrained: variation in rates of turnover across functional element classes in the human lineage. PLoS Genet, 10 (7), pp. e1004525. | Show Abstract | Read more

Ten years on from the finishing of the human reference genome sequence, it remains unclear what fraction of the human genome confers function, where this sequence resides, and how much is shared with other mammalian species. When addressing these questions, functional sequence has often been equated with pan-mammalian conserved sequence. However, functional elements that are short-lived, including those contributing to species-specific biology, will not leave a footprint of long-lasting negative selection. Here, we address these issues by identifying and characterising sequence that has been constrained with respect to insertions and deletions for pairs of eutherian genomes over a range of divergences. Within noncoding sequence, we find increasing amounts of mutually constrained sequence as species pairs become more closely related, indicating that noncoding constrained sequence turns over rapidly. We estimate that half of present-day noncoding constrained sequence has been gained or lost in approximately the last 130 million years (half-life in units of divergence time, d1/2 = 0.25-0.31). While enriched with ENCODE biochemical annotations, much of the short-lived constrained sequences we identify are not detected by models optimized for wider pan-mammalian conservation. Constrained DNase 1 hypersensitivity sites, promoters and untranslated regions have been more evolutionarily stable than long noncoding RNA loci which have turned over especially rapidly. By contrast, protein coding sequence has been highly stable, with an estimated half-life of over a billion years (d1/2 = 2.1-5.0). From extrapolations we estimate that 8.2% (7.1-9.2%) of the human genome is presently subject to negative selection and thus is likely to be functional, while only 2.2% has maintained constraint in both human and mouse since these species diverged. These results reveal that the evolutionary history of the human genome has been highly dynamic, particularly for its noncoding yet biologically functional fraction.

Delaneau O, Marchini J, 1000 Genomes Project Consortium, 1000 Genomes Project Consortium. 2014. Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nat Commun, 5 pp. 3934. | Show Abstract | Read more

A major use of the 1000 Genomes Project (1000 GP) data is genotype imputation in genome-wide association studies (GWAS). Here we develop a method to estimate haplotypes from low-coverage sequencing data that can take advantage of single-nucleotide polymorphism (SNP) microarray genotypes on the same samples. First the SNP array data are phased to build a backbone (or 'scaffold') of haplotypes across each chromosome. We then phase the sequence data 'onto' this haplotype scaffold. This approach can take advantage of relatedness between sequenced and non-sequenced samples to improve accuracy. We use this method to create a new 1000 GP haplotype reference set for use by the human genetic community. Using a set of validation genotypes at SNP and bi-allelic indels we show that these haplotypes have lower genotype discordance and improved imputation performance into downstream GWAS samples, especially at low-frequency variants.

Cazier JB, Rao SR, McLean CM, Walker AK, Wright BJ, Jaeger EE, Kartsonaki C, Marsden L, Yau C, Camps C et al. 2014. Whole-genome sequencing of bladder cancers reveals somatic CDKN1A mutations and clinicopathological associations with mutation burden. Nat Commun, 5 pp. 3756. | Show Abstract | Read more

Bladder cancers are a leading cause of death from malignancy. Molecular markers might predict disease progression and behaviour more accurately than the available prognostic factors. Here we use whole-genome sequencing to identify somatic mutations and chromosomal changes in 14 bladder cancers of different grades and stages. As well as detecting the known bladder cancer driver mutations, we report the identification of recurrent protein-inactivating mutations in CDKN1A and FAT1. The former are not mutually exclusive with TP53 mutations or MDM2 amplification, showing that CDKN1A dysfunction is not simply an alternative mechanism for p53 pathway inactivation. We find strong positive associations between higher tumour stage/grade and greater clonal diversity, the number of somatic mutations and the burden of copy number changes. In principle, the identification of sub-clones with greater diversity and/or mutation burden within early-stage or low-grade tumours could identify lesions with a high risk of invasive progression.

Lamble S, Batty E, Attar M, Buck D, Bowden R, Lunter G, Crook D, El-Fahmawi B, Piazza P. 2013. Improved workflows for high throughput library preparation using the transposome-based Nextera system. BMC Biotechnol, 13 (1), pp. 104. | Show Abstract | Read more

BACKGROUND: The Nextera protocol, which utilises a transposome based approach to create libraries for Illumina sequencing, requires pure DNA template, an accurate assessment of input concentration and a column clean-up that limits its applicability for high-throughput sample preparation. We addressed the identified limitations to develop a robust workflow that supports both rapid and high-throughput projects also reducing reagent costs. RESULTS: We show that an initial bead-based normalisation step can remove the need for quantification and improves sample purity. A 75% cost reduction was achieved with a low-volume modified protocol which was tested over genomes with different GC content to demonstrate its robustness. Finally we developed a custom set of index tags and primers which increase the number of samples that can simultaneously be sequenced on a single lane of an Illumina instrument. CONCLUSIONS: We addressed the bottlenecks of Nextera library construction to produce a modified protocol which harnesses the full power of the Nextera kit and allows the reproducible construction of libraries on a high-throughput scale reducing the associated cost of the kit.

Heger A, Webber C, Goodson M, Ponting CP, Lunter G. 2013. GAT: a simulation framework for testing the association of genomic intervals. Bioinformatics, 29 (16), pp. 2046-2048. | Show Abstract | Read more

MOTIVATION: A common question in genomic analysis is whether two sets of genomic intervals overlap significantly. This question arises, for example, when interpreting ChIP-Seq or RNA-Seq data in functional terms. Because genome organization is complex, answering this question is non-trivial. SUMMARY: We present Genomic Association Test (GAT), a tool for estimating the significance of overlap between multiple sets of genomic intervals. GAT implements a null model that the two sets of intervals are placed independently of one another, but allows each set's density to depend on external variables, for example, isochore structure or chromosome identity. GAT estimates statistical significance based on simulation and controls for multiple tests using the false discovery rate. AVAILABILITY: GAT's source code, documentation and tutorials are available at http://code.google.com/p/genomic-association-tester.

Montgomery SB, Goode DL, Kvikstad E, Albers CA, Zhang ZD, Mu XJ, Ananda G, Howie B, Karczewski KJ, Smith KS et al. 2013. The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Res, 23 (5), pp. 749-761. | Show Abstract | Read more

Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing three diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43%-48% of indels occurring in 4.03% of the genome, whereas in the remaining 96% their prevalence is 16 times lower than SNPs. Polymerase slippage can explain upwards of three-fourths of all indels, with the remainder being mostly simple deletions in complex sequence. However, insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage, which enables us to identify indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogeneity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, and that indels affecting multiple functionally constrained nucleotides undergo stronger purifying selection. We further find that indels are enriched in associations with gene expression and find evidence for a contribution of nonsense-mediated decay. Finally, we show that indels can be integrated in existing genome-wide association studies (GWAS); although we do not find direct evidence that potentially causal protein-coding indels are enriched with associations to known disease-associated SNPs, our findings suggest that the causal variant underlying some of these associations may be indels.

Bull KR, Rimmer AJ, Siggs OM, Miosge LA, Roots CM, Enders A, Bertram EM, Crockford TL, Whittle B, Potter PK et al. 2013. Unlocking the bottleneck in forward genetics using whole-genome sequencing and identity by descent to isolate causative mutations. PLoS Genet, 9 (1), pp. e1003219. | Show Abstract | Read more

Forward genetics screens with N-ethyl-N-nitrosourea (ENU) provide a powerful way to illuminate gene function and generate mouse models of human disease; however, the identification of causative mutations remains a limiting step. Current strategies depend on conventional mapping, so the propagation of affected mice requires non-lethal screens; accurate tracking of phenotypes through pedigrees is complex and uncertain; out-crossing can introduce unexpected modifiers; and Sanger sequencing of candidate genes is inefficient. Here we show how these problems can be efficiently overcome using whole-genome sequencing (WGS) to detect the ENU mutations and then identify regions that are identical by descent (IBD) in multiple affected mice. In this strategy, we use a modification of the Lander-Green algorithm to isolate causative recessive and dominant mutations, even at low coverage, on a pure strain background. Analysis of the IBD regions also allows us to calculate the ENU mutation rate (1.54 mutations per Mb) and to model future strategies for genetic screens in mice. The introduction of this approach will accelerate the discovery of causal variants, permit broader and more informative lethal screens to be used, reduce animal costs, and herald a new era for ENU mutagenesis.

Palles C, Cazier JB, Howarth KM, Domingo E, Jones AM, Broderick P, Kemp Z, Spain SL, Guarino E, Salguero I et al. 2013. Germline mutations affecting the proofreading domains of POLE and POLD1 predispose to colorectal adenomas and carcinomas. Nat Genet, 45 (2), pp. 136-144. | Show Abstract | Read more

Many individuals with multiple or large colorectal adenomas or early-onset colorectal cancer (CRC) have no detectable germline mutations in the known cancer predisposition genes. Using whole-genome sequencing, supplemented by linkage and association analysis, we identified specific heterozygous POLE or POLD1 germline variants in several multiple-adenoma and/or CRC cases but in no controls. The variants associated with susceptibility, POLE p.Leu424Val and POLD1 p.Ser478Asn, have high penetrance, and POLD1 mutation was also associated with endometrial cancer predisposition. The mutations map to equivalent sites in the proofreading (exonuclease) domain of DNA polymerases ɛ and δ and are predicted to cause a defect in the correction of mispaired bases inserted during DNA replication. In agreement with this prediction, the tumors from mutation carriers were microsatellite stable but tended to acquire base substitution mutations, as confirmed by yeast functional assays. Further analysis of published data showed that the recently described group of hypermutant, microsatellite-stable CRCs is likely to be caused by somatic POLE mutations affecting the exonuclease domain.

Mailund T, Halager AE, Westergaard M, Dutheil JY, Munch K, Andersen LN, Lunter G, Prüfer K, Scally A, Hobolth A, Schierup MH. 2012. A new isolation with migration model along complete genomes infers very different divergence processes among closely related great ape species. PLoS Genet, 8 (12), pp. e1003125. | Show Abstract | Read more

We present a hidden Markov model (HMM) for inferring gradual isolation between two populations during speciation, modelled as a time interval with restricted gene flow. The HMM describes the history of adjacent nucleotides in two genomic sequences, such that the nucleotides can be separated by recombination, can migrate between populations, or can coalesce at variable time points, all dependent on the parameters of the model, which are the effective population sizes, splitting times, recombination rate, and migration rate. We show by extensive simulations that the HMM can accurately infer all parameters except the recombination rate, which is biased downwards. Inference is robust to variation in the mutation rate and the recombination rate over the sequence and also robust to unknown phase of genomes unless they are very closely related. We provide a test for whether divergence is gradual or instantaneous, and we apply the model to three key divergence processes in great apes: (a) the bonobo and common chimpanzee, (b) the eastern and western gorilla, and (c) the Sumatran and Bornean orang-utan. We find that the bonobo and chimpanzee appear to have undergone a clear split, whereas the divergence processes of the gorilla and orang-utan species occurred over several hundred thousands years with gene flow stopping quite recently. We also apply the model to the Homo/Pan speciation event and find that the most likely scenario involves an extended period of gene flow during speciation.

Lise S, Clarkson Y, Perkins E, Kwasniewska A, Sadighi Akha E, Schnekenberg RP, Suminaite D, Hope J, Baker I, Gregory L et al. 2012. Recessive mutations in SPTBN2 implicate β-III spectrin in both cognitive and motor development. PLoS Genet, 8 (12), pp. e1003074. | Show Abstract | Read more

β-III spectrin is present in the brain and is known to be important in the function of the cerebellum. Heterozygous mutations in SPTBN2, the gene encoding β-III spectrin, cause Spinocerebellar Ataxia Type 5 (SCA5), an adult-onset, slowly progressive, autosomal-dominant pure cerebellar ataxia. SCA5 is sometimes known as "Lincoln ataxia," because the largest known family is descended from relatives of the United States President Abraham Lincoln. Using targeted capture and next-generation sequencing, we identified a homozygous stop codon in SPTBN2 in a consanguineous family in which childhood developmental ataxia co-segregates with cognitive impairment. The cognitive impairment could result from mutations in a second gene, but further analysis using whole-genome sequencing combined with SNP array analysis did not reveal any evidence of other mutations. We also examined a mouse knockout of β-III spectrin in which ataxia and progressive degeneration of cerebellar Purkinje cells has been previously reported and found morphological abnormalities in neurons from prefrontal cortex and deficits in object recognition tasks, consistent with the human cognitive phenotype. These data provide the first evidence that β-III spectrin plays an important role in cortical brain development and cognition, in addition to its function in the cerebellum; and we conclude that cognitive impairment is an integral part of this novel recessive ataxic syndrome, Spectrin-associated Autosomal Recessive Cerebellar Ataxia type 1 (SPARCA1). In addition, the identification of SPARCA1 and normal heterozygous carriers of the stop codon in SPTBN2 provides insights into the mechanism of molecular dominance in SCA5 and demonstrates that the cell-specific repertoire of spectrin subunits underlies a novel group of disorders, the neuronal spectrinopathies, which includes SCA5, SPARCA1, and a form of West syndrome.

Prüfer K, Munch K, Hellmann I, Akagi K, Miller JR, Walenz B, Koren S, Sutton G, Kodira C, Winer R et al. 2012. The bonobo genome compared with the chimpanzee and human genomes. Nature, 486 (7404), pp. 527-531. | Show Abstract | Read more

Two African apes are the closest living relatives of humans: the chimpanzee (Pan troglodytes) and the bonobo (Pan paniscus). Although they are similar in many respects, bonobos and chimpanzees differ strikingly in key social and sexual behaviours, and for some of these traits they show more similarity with humans than with each other. Here we report the sequencing and assembly of the bonobo genome to study its evolutionary relationship with the chimpanzee and human genomes. We find that more than three per cent of the human genome is more closely related to either the bonobo or the chimpanzee genome than these are to each other. These regions allow various aspects of the ancestry of the two ape species to be reconstructed. In addition, many of the regions that overlap genes may eventually help us understand the genetic basis of phenotypes that humans share with one of the two apes to the exclusion of the other.

Scally A, Dutheil JY, Hillier LW, Jordan GE, Goodhead I, Herrero J, Hobolth A, Lappalainen T, Mailund T, Marques-Bonet T et al. 2012. Insights into hominid evolution from the gorilla genome sequence. Nature, 483 (7388), pp. 169-175. | Show Abstract | Read more

Gorillas are humans' closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago. In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating an average sequence divergence time 1.75 million years ago, but with evidence for more recent genetic exchange and a population bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution.

Eizirik DL, Sammeth M, Bouckenooghe T, Bottu G, Sisino G, Igoillo-Esteve M, Ortis F, Santin I, Colli ML, Barthson J et al. 2012. The human pancreatic islet transcriptome: expression of candidate genes for type 1 diabetes and the impact of pro-inflammatory cytokines. PLoS Genet, 8 (3), pp. e1002552. | Show Abstract | Read more

Type 1 diabetes (T1D) is an autoimmune disease in which pancreatic beta cells are killed by infiltrating immune cells and by cytokines released by these cells. Signaling events occurring in the pancreatic beta cells are decisive for their survival or death in diabetes. We have used RNA sequencing (RNA-seq) to identify transcripts, including splice variants, expressed in human islets of Langerhans under control conditions or following exposure to the pro-inflammatory cytokines interleukin-1β (IL-1β) and interferon-γ (IFN-γ). Based on this unique dataset, we examined whether putative candidate genes for T1D, previously identified by GWAS, are expressed in human islets. A total of 29,776 transcripts were identified as expressed in human islets. Expression of around 20% of these transcripts was modified by pro-inflammatory cytokines, including apoptosis- and inflammation-related genes. Chemokines were among the transcripts most modified by cytokines, a finding confirmed at the protein level by ELISA. Interestingly, 35% of the genes expressed in human islets undergo alternative splicing as annotated in RefSeq, and cytokines caused substantial changes in spliced transcripts. Nova1, previously considered a brain-specific regulator of mRNA splicing, is expressed in islets and its knockdown modified splicing. 25/41 of the candidate genes for T1D are expressed in islets, and cytokines modified expression of several of these transcripts. The present study doubles the number of known genes expressed in human islets and shows that cytokines modify alternative splicing in human islet cells. Importantly, it indicates that more than half of the known T1D candidate genes are expressed in human islets. This, and the production of a large number of chemokines and cytokines by cytokine-exposed islets, reinforces the concept of a dialog between pancreatic islets and the immune system in T1D. This dialog is modulated by candidate genes for the disease at both the immune system and beta cell level.

MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, Jostins L, Habegger L, Pickrell JK, Montgomery SB et al. 2012. A systematic survey of loss-of-function variants in human protein-coding genes. Science, 335 (6070), pp. 823-828. | Show Abstract | Read more

Genome-sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in nonessential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.

1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature, 491 (7422), pp. 56-65. | Show Abstract | Read more

By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

Westesson O, Lunter G, Paten B, Holmes I. 2012. Accurate reconstruction of insertion-deletion histories by statistical phylogenetics. PLoS One, 7 (4), pp. e34572. | Show Abstract | Read more

The Multiple Sequence Alignment (MSA) is a computational abstraction that represents a partial summary either of indel history, or of structural similarity. Taking the former view (indel history), it is possible to use formal automata theory to generalize the phylogenetic likelihood framework for finite substitution models (Dayhoff's probability matrices and Felsenstein's pruning algorithm) to arbitrary-length sequences. In this paper, we report results of a simulation-based benchmark of several methods for reconstruction of indel history. The methods tested include a relatively new algorithm for statistical marginalization of MSAs that sums over a stochastically-sampled ensemble of the most probable evolutionary histories. For mammalian evolutionary parameters on several different trees, the single most likely history sampled by our algorithm appears less biased than histories reconstructed by other MSA methods. The algorithm can also be used for alignment-free inference, where the MSA is explicitly summed out of the analysis. As an illustration of our method, we discuss reconstruction of the evolutionary histories of human protein-coding genes.

Auton A, Fledel-Alon A, Pfeifer S, Venn O, Ségurel L, Street T, Leffler EM, Bowden R, Aneas I, Broxholme J et al. 2012. A fine-scale chimpanzee genetic map from population sequencing. Science, 336 (6078), pp. 193-198. | Show Abstract | Read more

To study the evolution of recombination rates in apes, we developed methodology to construct a fine-scale genetic map from high-throughput sequence data from 10 Western chimpanzees, Pan troglodytes verus. Compared to the human genetic map, broad-scale recombination rates tend to be conserved, but with exceptions, particularly in regions of chromosomal rearrangements and around the site of ancestral fusion in human chromosome 2. At fine scales, chimpanzee recombination is dominated by hotspots, which show no overlap with those of humans even though rates are similarly elevated around CpG islands and decreased within genes. The hotspot-specifying protein PRDM9 shows extensive variation among Western chimpanzees, and there is little evidence that any sequence motifs are enriched in hotspots. The contrasting locations of hotspots provide a natural experiment, which demonstrates the impact of recombination on base composition.

Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST et al. 2011. The variant call format and VCFtools. Bioinformatics, 27 (15), pp. 2156-2158. | Show Abstract | Read more

SUMMARY: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. AVAILABILITY: http://vcftools.sourceforge.net

Mailund T, Dutheil JY, Hobolth A, Lunter G, Schierup MH. 2011. Estimating divergence time and ancestral effective population size of Bornean and Sumatran orangutan subspecies using a coalescent hidden Markov model. PLoS Genet, 7 (3), pp. e1001319. | Show Abstract | Read more

Due to genetic variation in the ancestor of two populations or two species, the divergence time for DNA sequences from two populations is variable along the genome. Within genomic segments all bases will share the same divergence-because they share a most recent common ancestor-when no recombination event has occurred to split them apart. The size of these segments of constant divergence depends on the recombination rate, but also on the speciation time, the effective population size of the ancestral population, as well as demographic effects and selection. Thus, inference of these parameters may be possible if we can decode the divergence times along a genomic alignment. Here, we present a new hidden Markov model that infers the changing divergence (coalescence) times along the genome alignment using a coalescent framework, in order to estimate the speciation time, the recombination rate, and the ancestral effective population size. The model is efficient enough to allow inference on whole-genome data sets. We first investigate the power and consistency of the model with coalescent simulations and then apply it to the whole-genome sequences of the two orangutan sub-species, Bornean (P. p. pygmaeus) and Sumatran (P. p. abelii) orangutans from the Orangutan Genome Project. We estimate the speciation time between the two sub-species to be thousand years ago and the effective population size of the ancestral orangutan species to be , consistent with recent results based on smaller data sets. We also report a negative correlation between chromosome size and ancestral effective population size, which we interpret as a signature of recombination increasing the efficacy of selection.

Locke DP, Hillier LW, Warren WC, Worley KC, Nazareth LV, Muzny DM, Yang SP, Wang Z, Chinwalla AT, Minx P et al. 2011. Comparative and demographic analysis of orang-utan genomes. Nature, 469 (7331), pp. 529-533. | Show Abstract | Read more

'Orang-utan' is derived from a Malay term meaning 'man of the forest' and aptly describes the southeast Asian great apes native to Sumatra and Borneo. The orang-utan species, Pongo abelii (Sumatran) and Pongo pygmaeus (Bornean), are the most phylogenetically distant great apes from humans, thereby providing an informative perspective on hominid evolution. Here we present a Sumatran orang-utan draft genome assembly and short read sequence data from five Sumatran and five Bornean orang-utan genomes. Our analyses reveal that, compared to other primates, the orang-utan genome has many unique features. Structural evolution of the orang-utan genome has proceeded much more slowly than other great apes, evidenced by fewer rearrangements, less segmental duplication, a lower rate of gene family turnover and surprisingly quiescent Alu repeats, which have played a major role in restructuring other primate genomes. We also describe a primate polymorphic neocentromere, found in both Pongo species, emphasizing the gradual evolution of orang-utan genome structure. Orang-utans have extremely low energy usage for a eutherian mammal, far lower than their hominid relatives. Adding their genome to the repertoire of sequenced primates illuminates new signals of positive selection in several pathways including glycolipid metabolism. From the population perspective, both Pongo species are deeply diverse; however, Sumatran individuals possess greater diversity than their Bornean counterparts, and more species-specific variation. Our estimate of Bornean/Sumatran speciation time, 400,000 years ago, is more recent than most previous studies and underscores the complexity of the orang-utan speciation process. Despite a smaller modern census population size, the Sumatran effective population size (N(e)) expanded exponentially relative to the ancestral N(e) after the split, while Bornean N(e) declined over the same period. Overall, the resources and analyses presented here offer new opportunities in evolutionary genomics, insights into hominid biology, and an extensive database of variation for conservation efforts.

1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. 2010. A map of human genome variation from population-scale sequencing. Nature, 467 (7319), pp. 1061-1073. | Show Abstract | Read more

The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

Lunter G, Goodson M. 2011. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res, 21 (6), pp. 936-939. | Show Abstract | Read more

High-volume sequencing of DNA and RNA is now within reach of any research laboratory and is quickly becoming established as a key research tool. In many workflows, each of the short sequences ("reads") resulting from a sequencing run are first "mapped" (aligned) to a reference sequence to infer the read from which the genomic location derived, a challenging task because of the high data volumes and often large genomes. Existing read mapping software excel in either speed (e.g., BWA, Bowtie, ELAND) or sensitivity (e.g., Novoalign), but not in both. In addition, performance often deteriorates in the presence of sequence variation, particularly so for short insertions and deletions (indels). Here, we present a read mapper, Stampy, which uses a hybrid mapping algorithm and a detailed statistical model to achieve both speed and sensitivity, particularly when reads include sequence variation. This results in a higher useable sequence yield and improved accuracy compared to that of existing software.

Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R. 2011. Dindel: accurate indel calls from short-read data. Genome Res, 21 (6), pp. 961-973. | Show Abstract | Read more

Small insertions and deletions (indels) are a common and functionally important type of sequence polymorphism. Most of the focus of studies of sequence variation is on single nucleotide variants (SNVs) and large structural variants. In principle, high-throughput sequencing studies should allow identification of indels just as SNVs. However, inference of indels from next-generation sequence data is challenging, and so far methods for identifying indels lag behind methods for calling SNVs in terms of sensitivity and specificity. We propose a Bayesian method to call indels from short-read sequence data in individuals and populations by realigning reads to candidate haplotypes that represent alternative sequence to the reference. The candidate haplotypes are formed by combining candidate indels and SNVs identified by the read mapper, while allowing for known sequence variants or candidates from other methods to be included. In our probabilistic realignment model we account for base-calling errors, mapping errors, and also, importantly, for increased sequencing error indel rates in long homopolymer runs. We show that our method is sensitive and achieves low false discovery rates on simulated and real data sets, although challenges remain. The algorithm is implemented in the program Dindel, which has been used in the 1000 Genomes Project call sets.

Meader S, Ponting CP, Lunter G. 2010. Massive turnover of functional sequence in human and other mammalian genomes. Genome Res, 20 (10), pp. 1335-1343. | Show Abstract | Read more

Despite the availability of dozens of animal genome sequences, two key questions remain unanswered: First, what fraction of any species' genome confers biological function, and second, are apparent differences in organismal complexity reflected in an objective measure of genomic complexity? Here, we address both questions by applying, across the mammalian phylogeny, an evolutionary model that estimates the amount of functional DNA that is shared between two species' genomes. Our main findings are, first, that as the divergence between mammalian species increases, the predicted amount of pairwise shared functional sequence drops off dramatically. We show by simulations that this is not an artifact of the method, but rather indicates that functional (and mostly noncoding) sequence is turning over at a very high rate. We estimate that between 200 and 300 Mb (∼6.5%-10%) of the human genome is under functional constraint, which includes five to eight times as many constrained noncoding bases than bases that code for protein. In contrast, in D. melanogaster we estimate only 56-66 Mb to be constrained, implying a ratio of noncoding to coding constrained bases of about 2. This suggests that, rather than genome size or protein-coding gene complement, it is the number of functional bases that might best mirror our naïve preconceptions of organismal complexity.

Satija R, Hein J, Lunter GA. 2010. Genome-wide functional element detection using pairwise statistical alignment outperforms multiple genome footprinting techniques. Bioinformatics, 26 (17), pp. 2116-2120. | Show Abstract | Read more

MOTIVATION: Comparative genomic sequence analysis is a powerful approach for identifying putative functional elements in silico. The availability of full-genome sequences from many vertebrate species has resulted in the development of popular tools, for example, the phastCons software package that search large numbers of genomes to identify conserved elements. While phastCons can analyze many genomes simultaneously, it ignores potentially informative insertion and deletion events and relies on a fixed, precomputed multiple sequence alignment. RESULTS: We have developed a new method, GRAPeFoot, which simultaneously aligns two full genomes and annotates a set of conserved regions exhibiting reduced rates of insertion, deletion and substitution mutations. We tested GRAPeFoot using the human and mouse genomes and compared its performance to a set of phastCons predictions hosted on the UCSC genome browser. Our results demonstrate that despite the use of only two genomes, GRAPeFoot identified constrained elements at rates comparable with phastCons, which analyzed data from 28 vertebrate genomes. This study demonstrates how integrated modelling of substitutions, indels and purifying selection allows a pairwise analysis to exhibit a sensitivity similar to a heuristic analysis of many genomes. AVAILABILITY: The GRAPeFoot software and set of genome-wide functional element predictions are freely available to download online at http://www.stats.ox.ac.uk/ approximately satija/GRAPeFoot/.

Meader S, Hillier LW, Locke D, Ponting CP, Lunter G. 2010. Genome assembly quality: assessment and improvement using the neutral indel model. Genome Res, 20 (5), pp. 675-684. | Show Abstract | Read more

We describe a statistical and comparative-genomic approach for quantifying error rates of genome sequence assemblies. The method exploits not substitutions but the pattern of insertions and deletions (indels) in genome-scale alignments for closely related species. Using two- or three-way alignments, the approach estimates the amount of aligned sequence containing clusters of nucleotides that were wrongly inserted or deleted during sequencing or assembly. Thus, the method is well-suited to assessing fine-scale sequence quality within single assemblies, between different assemblies of a single set of reads, and between genome assemblies for different species. When applying this approach to four primate genome assemblies, we found that average gap error rates per base varied considerably, by up to sixfold. As expected, bacterial artificial chromosome (BAC) sequences contained lower, but still substantial, predicted numbers of errors, arguing for caution in regarding BACs as the epitome of genome fidelity. We then mapped short reads, at approximately 10-fold statistical coverage, from a Bornean orangutan onto the Sumatran orangutan genome assembly originally constructed from capillary reads. This resulted in a reduced gap error rate and a separation of error-prone from high-fidelity sequence. Over 5000 predicted indel errors in protein-coding sequence were corrected in a hybrid assembly. Our approach contributes a new fine-scale quality metric for assemblies that should facilitate development of improved genome sequencing and assembly strategies.

Oliver PL, Goodstadt L, Bayes JJ, Birtle Z, Roach KC, Phadnis N, Beatson SA, Lunter G, Malik HS, Ponting CP. 2009. Accelerated evolution of the Prdm9 speciation gene across diverse metazoan taxa. PLoS Genet, 5 (12), pp. e1000753. | Show Abstract | Read more

The onset of prezygotic and postzygotic barriers to gene flow between populations is a hallmark of speciation. One of the earliest postzygotic isolating barriers to arise between incipient species is the sterility of the heterogametic sex in interspecies' hybrids. Four genes that underlie hybrid sterility have been identified in animals: Odysseus, JYalpha, and Overdrive in Drosophila and Prdm9 (Meisetz) in mice. Mouse Prdm9 encodes a protein with a KRAB motif, a histone methyltransferase domain and several zinc fingers. The difference of a single zinc finger distinguishes Prdm9 alleles that cause hybrid sterility from those that do not. We find that concerted evolution and positive selection have rapidly altered the number and sequence of Prdm9 zinc fingers across 13 rodent genomes. The patterns of positive selection in Prdm9 zinc fingers imply that rapid evolution has acted on the interface between the Prdm9 protein and the DNA sequences to which it binds. Similar patterns are apparent for Prdm9 zinc fingers for diverse metazoans, including primates. Indeed, allelic variation at the DNA-binding positions of human PRDM9 zinc fingers show significant association with decreased risk of infertility. Prdm9 thus plays a role in determining male sterility both between species (mouse) and within species (human). The recurrent episodes of positive selection acting on Prdm9 suggest that the DNA sequences to which it binds must also be evolving rapidly. Our findings do not identify the nature of the underlying DNA sequences, but argue against the proposed role of Prdm9 as an essential transcription factor in mouse meiosis. We propose a hypothetical model in which incompatibilities between Prdm9-binding specificity and satellite DNAs provide the molecular basis for Prdm9-mediated hybrid sterility. We suggest that Prdm9 should be investigated as a candidate gene in other instances of hybrid sterility in metazoans.

Ponjavic J, Oliver PL, Lunter G, Ponting CP. 2009. Genomic and transcriptional co-localization of protein-coding and long non-coding RNA pairs in the developing brain. PLoS Genet, 5 (8), pp. e1000617. | Show Abstract | Read more

Besides protein-coding mRNAs, eukaryotic transcriptomes include many long non-protein-coding RNAs (ncRNAs) of unknown function that are transcribed away from protein-coding loci. Here, we have identified 659 intergenic long ncRNAs whose genomic sequences individually exhibit evolutionary constraint, a hallmark of functionality. Of this set, those expressed in the brain are more frequently conserved and are significantly enriched with predicted RNA secondary structures. Furthermore, brain-expressed long ncRNAs are preferentially located adjacent to protein-coding genes that are (1) also expressed in the brain and (2) involved in transcriptional regulation or in nervous system development. This led us to the hypothesis that spatiotemporal co-expression of ncRNAs and nearby protein-coding genes represents a general phenomenon, a prediction that was confirmed subsequently by in situ hybridisation in developing and adult mouse brain. We provide the full set of constrained long ncRNAs as an important experimental resource and present, for the first time, substantive and predictive criteria for prioritising long ncRNA and mRNA transcript pairs when investigating their biological functions and contributions to development and disease.

Chaix R, Somel M, Kreil DP, Khaitovich P, Lunter GA. 2008. Evolution of primate gene expression: drift and corrective sweeps? Genetics, 180 (3), pp. 1379-1389. | Show Abstract | Read more

Changes in gene expression play an important role in species' evolution. Earlier studies uncovered evidence that the effect of mutations on expression levels within the primate order is skewed, with many small downregulations balanced by fewer but larger upregulations. In addition, brain-expressed genes appeared to show an increased rate of evolution on the branch leading to human. However, the lack of a mathematical model adequately describing the evolution of gene expression precluded the rigorous establishment of these observations. Here, we develop mathematical tools that allow us to revisit these earlier observations in a model-testing and inference framework. We introduce a model for skewed gene-expression evolution within a phylogenetic tree and use a separate model to account for biological or experimental outliers. A Bayesian Markov chain Monte Carlo inference procedure allows us to infer the phylogeny and other evolutionary parameters, while quantifying the confidence in these inferences. Our results support previous observations; in particular, we find strong evidence for a sustained positive skew in the distribution of gene-expression changes in primate evolution. We propose a "corrective sweep" scenario to explain this phenomenon.

Lunter G, Rocco A, Mimouni N, Heger A, Caldeira A, Hein J. 2008. Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res, 18 (2), pp. 298-309. | Show Abstract | Read more

Sequence alignment underpins all of comparative genomics, yet it remains an incompletely solved problem. In particular, the statistical uncertainty within inferred alignments is often disregarded, while parametric or phylogenetic inferences are considered meaningless without confidence estimates. Here, we report on a theoretical and simulation study of pairwise alignments of genomic DNA at human-mouse divergence. We find that >15% of aligned bases are incorrect in existing whole-genome alignments, and we identify three types of alignment error, each leading to systematic biases in all algorithms considered. Careful modeling of the evolutionary process improves alignment quality; however, these improvements are modest compared with the remaining alignment errors, even with exact knowledge of the evolutionary model, emphasizing the need for statistical approaches to account for uncertainty. We develop a new algorithm, Marginalized Posterior Decoding (MPD), which explicitly accounts for uncertainties, is less biased and more accurate than other algorithms we consider, and reduces the proportion of misaligned bases by a third compared with the best existing algorithm. To our knowledge, this is the first nonheuristic algorithm for DNA sequence alignment to show robust improvements over the classic Needleman-Wunsch algorithm. Despite this, considerable uncertainty remains even in the improved alignments. We conclude that a probabilistic treatment is essential, both to improve alignment quality and to quantify the remaining uncertainty. This is becoming increasingly relevant with the growing appreciation of the importance of noncoding DNA, whose study relies heavily on alignments. Alignment errors are inevitable, and should be considered when drawing conclusions from alignments. Software and alignments to assist researchers in doing this are provided at http://genserv.anat.ox.ac.uk/grape/.

de Groot S, Mailund T, Lunter G, Hein J. 2008. Investigating selection on viruses: a statistical alignment approach. BMC Bioinformatics, 9 (1), pp. 304. | Show Abstract | Read more

BACKGROUND: Two problems complicate the study of selection in viral genomes: Firstly, the presence of genes in overlapping reading frames implies that selection in one reading frame can bias our estimates of neutral mutation rates in another reading frame. Secondly, the high mutation rates we are likely to encounter complicate the inference of a reliable alignment of genomes. To address these issues, we develop a model that explicitly models selection in overlapping reading frames. We then integrate this model into a statistical alignment framework, enabling us to estimate selection while explicitly dealing with the uncertainty of individual alignments. We show that in this way we obtain un-biased selection parameters for different genomic regions of interest, and can improve in accuracy compared to using a fixed alignment. RESULTS: We run a series of simulation studies to gauge how well we do in selection estimation, especially in comparison to the use of a fixed alignment. We show that the standard practice of using a ClustalW alignment can lead to considerable biases and that estimation accuracy increases substantially when explicitly integrating over the uncertainty in inferred alignments. We even manage to compete favourably for general evolutionary distances with an alignment produced by GenAl. We subsequently run our method on HIV2 and Hepatitis B sequences. CONCLUSION: We propose that marginalizing over all alignments, as opposed to using a fixed one, should be considered in any parametric inference from divergent sequence data for which the alignments are not known with certainty. Moreover, we discover in HIV2 that double coding regions appear to be under less stringent selection than single coding ones. Additionally, there appears to be evidence for differential selection, where one overlapping reading frame is under positive and the other under negative selection.

Lunter G. 2007. HMMoC--a compiler for hidden Markov models. Bioinformatics, 23 (18), pp. 2485-2487. | Show Abstract | Read more

UNLABELLED: Hidden Markov models are widely applied within computational biology. The large data sets and complex models involved demand optimized implementations, while efficient exploration of model space requires rapid prototyping. These requirements are not met by existing solutions, and hand-coding is time-consuming and error-prone. Here, I present a compiler that takes over the mechanical process of implementing HMM algorithms, by translating high-level XML descriptions into efficient C++ implementations. The compiler is highly customizable, produces efficient and bug-free code, and includes several optimizations. AVAILABILITY: http://genserv.anat.ox.ac.uk/software.

Lunter G. 2007. Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics, 23 (13), pp. i289-i296. | Show Abstract | Read more

MOTIVATION: The two mutation processes that have the largest impact on genome evolution at small scales are substitutions, and sequence insertions and deletions (indels). While the former have been studied extensively, indels have received less attention, and in particular, the problem of inferring indel rates between pairs of divergent sequence remains unsolved. Here, I describe a novel and accurate method for estimating neutral indel rates between divergent pairs of genomes. RESULTS: Simulations suggest that new method for estimating indel rates is accurate to within 2%, at divergences corresponding to that of human and mouse. Applying the method to these species, I show that indel rates are up to twice higher than is apparent from alignments, and depend strongly on the local G + C content. These results indicate that at these evolutionary distances, the contribution of indels to sequence divergence is much larger than hitherto appreciated. In particular, the ratio of substitution to indel rates between human and mouse appears to be around gamma = 8, rather than the currently accepted value of about gamma = 14.

Ponjavic J, Ponting CP, Lunter G. 2007. Functionality or transcriptional noise? Evidence for selection within long noncoding RNAs. Genome Res, 17 (5), pp. 556-565. | Show Abstract | Read more

Long transcripts that do not encode protein have only rarely been the subject of experimental scrutiny. Presumably, this is owing to the current lack of evidence of their functionality, thereby leaving an impression that, instead, they represent "transcriptional noise." Here, we describe an analysis of 3122 long and full-length, noncoding RNAs ("macroRNAs") from the mouse, and compare their sequences and their promoters with orthologous sequence from human and from rat. We considered three independent signatures of purifying selection related to substitutions, sequence insertions and deletions, and splicing. We find that the evolution of the set of noncoding RNAs is not consistent with neutralist explanations. Rather, our results indicate that purifying selection has acted on the macroRNAs' promoters, primary sequence, and consensus splice site motifs. Promoters have experienced the greatest elimination of nucleotide substitutions, insertions, and deletions. The proportion of conserved sequence (4.1%-5.5%) in these macroRNAs is comparable to the density of exons within protein-coding transcripts (5.2%). These macroRNAs, taken together, thus possess the imprint of purifying selection, thereby indicating their functionality. Our findings should now provide an incentive for the experimental investigation of these macroRNAs' functions.

Lunter G. 2007. Dog as an outgroup to human and mouse. PLoS Comput Biol, 3 (4), pp. e74. | Read more

Juhas M, Crook DW, Dimopoulou ID, Lunter G, Harding RM, Ferguson DJP, Hood DW. 2007. Functional analysis of the novel Type IV secretion system involved in propagation of genomic islands. PLASMID, 57 (2), pp. 221-221.

Juhas M, Crook DW, Dimopoulou ID, Lunter G, Harding RM, Ferguson DJ, Hood DW. 2007. Novel type IV secretion system involved in propagation of genomic islands. J Bacteriol, 189 (3), pp. 761-771. | Show Abstract | Read more

Type IV secretion systems (T4SSs) mediate horizontal gene transfer, thus contributing to genome plasticity, evolution of infectious pathogens, and dissemination of antibiotic resistance and other virulence traits. A gene cluster of the Haemophilus influenzae genomic island ICEHin1056 has been identified as a T4SS involved in the propagation of genomic islands. This T4SS is novel and evolutionarily distant from the previously described systems. Mutation analysis showed that inactivation of key genes of this system resulted in a loss of phenotypic traits provided by a T4SS. Seven of 10 mutants with a mutation in this T4SS did not express the type IV secretion pilus. Correspondingly, disruption of the genes resulted in up to 100,000-fold reductions in conjugation frequencies compared to those of the parent strain. Moreover, the expression of this T4SS was found to be positively regulated by one of its components, the tfc24 gene. We concluded that this gene cluster represents a novel family of T4SSs involved in propagation of genomic islands.

Ponting CP, Lunter G. 2006. Signatures of adaptive evolution within human non-coding sequence. Hum Mol Genet, 15 Spec No 2 (Review Issue 2), pp. R170-R175. | Show Abstract | Read more

The human genome is often portrayed as consisting of three sequence types, each distinguished by their mode of evolution. Purifying selection is estimated to act on 2.5-5.0% of the genome, whereas virtually all remaining sequence is considered to have evolved neutrally and to be devoid of functionality. The third mode of evolution, positive selection of advantageous changes, is considered rare. Such instances have been inferred only for a handful of sites, and these lie almost exclusively within protein-coding genes. Nevertheless, the majority of positively selected sequence is expected to lie within the wealth of functional 'dark matter' present outside of the coding sequence. Here, we review the evolutionary evidence for the majority of human-conserved DNA lying outside of the protein-coding sequence. We argue that within this non-coding fraction lies at least 1 Mb of functional sequence that has accumulated many beneficial nucleotide replacements. Illuminating the functions of this adaptive dark matter will lead to a better understanding of the sequence changes that have shaped the innovative biology of our species.

Ponting CP, Lunter G. 2006. Evolutionary biology: human brain gene wins genome race. Nature, 443 (7108), pp. 149-150. | Show Abstract | Read more

The differences in brain size and function that separate humans from other mammals must be reflected in our genomes. It seems that the non-coding 'dark matter' of genomes harbours most of these vital changes. ©2006 Nature PublishingGroup.

Lunter G, Ponting CP, Hein J. 2006. Genome-wide identification of human functional DNA using a neutral indel model. PLoS Comput Biol, 2 (1), pp. e5. | Show Abstract | Read more

It has become clear that a large proportion of functional DNA in the human genome does not code for protein. Identification of this non-coding functional sequence using comparative approaches is proving difficult and has previously been thought to require deep sequencing of multiple vertebrates. Here we introduce a new model and comparative method that, instead of nucleotide substitutions, uses the evolutionary imprint of insertions and deletions (indels) to infer the past consequences of selection. The model predicts the distribution of indels under neutrality, and shows an excellent fit to human-mouse ancestral repeat data. Across the genome, many unusually long ungapped regions are detected that are unaccounted for by the neutral model, and which we predict to be highly enriched in functional DNA that has been subject to purifying selection with respect to indels. We use the model to determine the proportion under indel-purifying selection to be between 2.56% and 3.25% of human euchromatin. Since annotated protein-coding genes comprise only 1.2% of euchromatin, these results lend further weight to the proposition that more than half the functional complement of the human genome is non-protein-coding. The method is surprisingly powerful at identifying selected sequence using only two or three mammalian genomes. Applying the method to the human, mouse, and dog genomes, we identify 90 Mb of human sequence under indel-purifying selection, at a predicted 10% false-discovery rate and 75% sensitivity. As expected, most of the identified sequence represents unannotated material, while the recovered proportions of known protein-coding and microRNA genes closely match the predicted sensitivity of the method. The method's high sensitivity to functional sequence such as microRNAs suggest that as yet unannotated microRNA genes are enriched among the sequences identified. Furthermore, its independence of substitutions allowed us to identify sequence that has been subject to heterogeneous selection, that is, sequence subject to both positive selection with respect to substitutions and purifying selection with respect to indels. The ability to identify elements under heterogeneous selection enables, for the first time, the genome-wide investigation of positive selection on functional elements other than protein-coding genes.

Lunter G, Miklós I, Drummond A, Jensen JL, Hein J. 2005. Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics, 6 pp. 83. | Show Abstract | Read more

BACKGROUND: Two central problems in computational biology are the determination of the alignment and phylogeny of a set of biological sequences. The traditional approach to this problem is to first build a multiple alignment of these sequences, followed by a phylogenetic reconstruction step based on this multiple alignment. However, alignment and phylogenetic inference are fundamentally interdependent, and ignoring this fact leads to biased and overconfident estimations. Whether the main interest be in sequence alignment or phylogeny, a major goal of computational biology is the co-estimation of both. RESULTS: We developed a fully Bayesian Markov chain Monte Carlo method for coestimating phylogeny and sequence alignment, under the Thorne-Kishino-Felsenstein model of substitution and single nucleotide insertion-deletion (indel) events. In our earlier work, we introduced a novel and efficient algorithm, termed the "indel peeling algorithm", which includes indels as phylogenetically informative evolutionary events, and resembles Felsenstein's peeling algorithm for substitutions on a phylogenetic tree. For a fixed alignment, our extension analytically integrates out both substitution and indel events within a proper statistical model, without the need for data augmentation at internal tree nodes, allowing for efficient sampling of tree topologies and edge lengths. To additionally sample multiple alignments, we here introduce an efficient partial Metropolized independence sampler for alignments, and combine these two algorithms into a fully Bayesian co-estimation procedure for the alignment and phylogeny problem. Our approach results in estimates for the posterior distribution of evolutionary rate parameters, for the maximum a-posteriori (MAP) phylogenetic tree, and for the posterior decoding alignment. Estimates for the evolutionary tree and multiple alignment are augmented with confidence estimates for each node height and alignment column. Our results indicate that the patterns in reliability broadly correspond to structural features of the proteins, and thus provides biologically meaningful information which is not existent in the usual point-estimate of the alignment. Our methods can handle input data of moderate size (10-20 protein sequences, each 100-200 bp), which we analyzed overnight on a standard 2 GHz personal computer. CONCLUSION: Joint analysis of multiple sequence alignment, evolutionary trees and additional evolutionary parameters can be now done within a single coherent statistical framework.

Lunter G, Hein J. 2004. A nucleotide substitution model with nearest-neighbour interactions. Bioinformatics, 20 Suppl 1 (Suppl 1), pp. i216-i223. | Show Abstract | Read more

MOTIVATION: It is well known that neighbouring nucleotides in DNA sequences do not mutate independently of each other. In this paper, we introduce a context-dependent substitution model and derive an algorithm to calculate the likelihood of sequences evolving under this model. We use this algorithm to estimate neighbour-dependent substitution rates, as well as rates for dinucleotide substitutions, using a Bayesian sampling procedure. The model is irreversible, giving an arrow to time, and allowing the position of the root between a pair of sequences to be inferred without using out-groups. RESULTS: We applied the model upon aligned human-mouse non-coding data. Clear neighbour dependencies were observed, including 17-18-fold increased CpG to TpG/CpA rates compared with other substitutions. Root inference positioned the root halfway the mouse and human tips, suggesting an approximately clock-like behaviour of the irreversible part of the substitution process.

Miklós I, Lunter GA, Holmes I. 2004. A "Long Indel" model for evolutionary sequence alignment. Mol Biol Evol, 21 (3), pp. 529-540. | Show Abstract | Read more

We present a new probabilistic model of sequence evolution, allowing indels of arbitrary length, and give sequence alignment algorithms for our model. Previously implemented evolutionary models have allowed (at most) single-residue indels or have introduced artifacts such as the existence of indivisible "fragments." We compare our algorithm to these previous methods by applying it to the structural homology dataset HOMSTRAD, evaluating the accuracy of (1) alignments and (2) evolutionary time estimates. With our method, it is possible (for the first time) to integrate probabilistic sequence alignment, with reliability indicators and arbitrary gap penalties, in the same framework as phylogenetic reconstruction. Our alignment algorithm requires that we evaluate the likelihood of any specific path of mutation events in a continuous-time Markov model, with the event times integrated out. To this effect, we introduce a "trajectory likelihood" algorithm (Appendix A). We anticipate that this algorithm will be useful in more general contexts, such as Markov Chain Monte Carlo simulations.

Broer H, Hoveijn I, Lunter G, Vegter G, , , , . 2003. Bifurcations in Hamiltonian systems - Computing singularities by Grobner bases - Preface BIFURCATIONS IN HAMILTONIAN SYSTEMS, 1806 pp. V-+.

Lunter GA, Miklós I, Song YS, Hein J. 2003. An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. J Comput Biol, 10 (6), pp. 869-889. | Show Abstract | Read more

We present an efficient algorithm for statistical multiple alignment based on the TKF91 model of Thorne, Kishino, and Felsenstein (1991) on an arbitrary k-leaved phylogenetic tree. The existing algorithms use a hidden Markov model approach, which requires at least O( radical 5(k)) states and leads to a time complexity of O(5(k)L(k)), where L is the geometric mean sequence length. Using a combinatorial technique reminiscent of inclusion/exclusion, we are able to sum away the states, thus improving the time complexity to O(2(k)L(k)) and considerably reducing memory requirements. This makes statistical multiple alignment under the TKF91 model a definite practical possibility in the case of a phylogenetic tree with a modest number of leaves.

Broer HW, Hoveijn I, Lunter GA, Vegter G. 1998. Resonances in a spring-pendulum: algorithms for equivariant singularity theory NONLINEARITY, 11 (6), pp. 1569-1605. | Show Abstract | Read more

A spring-pendulum in resonance is a time-independent Hamiltonian model system for formal reduction to one degree of freedom, where some symmetry (reversibility) is maintained. The reduction is handled by equivariant singularity theory with a distinguished parameter, yielding an integrable approximation of the Poincaré map. This makes a concise description of certain bifurcations possible. The computation of reparametrizations from normal form to the actual system is performed by Gröbner basis techniques.

LUNTER G. 1994. NEW PROOFS AND A GENERALIZATION OF INEQUALITIES OF FAN, TAUSSKY, AND TODD JOURNAL OF MATHEMATICAL ANALYSIS AND APPLICATIONS, 185 (2), pp. 464-476. | Show Abstract | Read more

Discrete Fourier analysis is used to obtain simple proofs of certain inequalities about finite number sequences determined by Fan, Taussky, and Todd [Monatsh. Math. 59 (1955), 73-90] and their converses determined by Milovanović and Milovanović [J. Math., Anal. Appl.88 (1992), 378-387] . Using the same techniques, the inequality [formula] is proved for all real numbers 0=b 0 , b 1 , …, b n , b n+1 =0, which answers a question raised by Alzer [J. Math. Anal. Appl.161 (1991), 142-147]. Second, the method is used to obtain the eigenvalues and eigenvectors of matrices (a ij ) that are rotation-invariant, i.e., that obey (a ij )=(a (i+1)(j+1) ). © 1994 Academic Press, Inc.

Westesson O, Lunter G, Paten B, Holmes I. Phylogenetic automata, pruning, and multiple alignment | Show Abstract

We present an extension of Felsenstein's algorithm to indel models defined on entire sequences, without the need to condition on one multiple alignment. The algorithm makes use of a generalization from probabilistic substitution matrices to weighted finite-state transducers. Our approach may equivalently be viewed as a probabilistic formulation of progressive multiple sequence alignment, using partial-order graphs to represent ensemble profiles of ancestral sequences. We present a hierarchical stochastic approximation technique which makes this algorithm tractable for alignment analyses of reasonable size.

Hoehn KB, Lunter G, Pybus OG. 2017. A Phylogenetic Codon Substitution Model for Antibody Lineages. Genetics, 206 (1), pp. 417-427. | Show Abstract | Read more

Phylogenetic methods have shown promise in understanding the development of broadly neutralizing antibody lineages (bNAbs). However, the mutational process that generates these lineages, somatic hypermutation, is biased by hotspot motifs which violates important assumptions in most phylogenetic substitution models. Here, we develop a modified GY94-type substitution model that partially accounts for this context dependency while preserving independence of sites during calculation. This model shows a substantially better fit to three well-characterized bNAb lineages than the standard GY94 model. We also demonstrate how our model can be used to test hypotheses concerning the roles of different hotspot and coldspot motifs in the evolution of B-cell lineages. Further, we explore the consequences of the idea that the number of hotspot motifs, and perhaps the mutation rate in general, is expected to decay over time in individual bNAb lineages.

Hoehn KB, Fowler A, Lunter G, Pybus OG. 2016. The Diversity and Molecular Evolution of B-Cell Receptors during Infection. Mol Biol Evol, 33 (5), pp. 1147-1157. | Show Abstract | Read more

B-cell receptors (BCRs) are membrane-bound immunoglobulins that recognize and bind foreign proteins (antigens). BCRs are formed through random somatic changes of germline DNA, creating a vast repertoire of unique sequences that enable individuals to recognize a diverse range of antigens. After encountering antigen for the first time, BCRs undergo a process of affinity maturation, whereby cycles of rapid somatic mutation and selection lead to improved antigen binding. This constitutes an accelerated evolutionary process that takes place over days or weeks. Next-generation sequencing of the gene regions that determine BCR binding has begun to reveal the diversity and dynamics of BCR repertoires in unprecedented detail. Although this new type of sequence data has the potential to revolutionize our understanding of infection dynamics, quantitative analysis is complicated by the unique biology and high diversity of BCR sequences. Models and concepts from molecular evolution and phylogenetics that have been applied successfully to rapidly evolving pathogen populations are increasingly being adopted to study BCR diversity and divergence within individuals. However, BCR dynamics may violate key assumptions of many standard evolutionary methods, as they do not descend from a single ancestor, and experience biased mutation. Here, we review the application of evolutionary models to BCR repertoires and discuss the issues we believe need be addressed for this interdisciplinary field to flourish.

1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR. 2015. A global reference for human genetic variation. Nature, 526 (7571), pp. 68-74. | Show Abstract | Read more

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

Staab PR, Zhu S, Metzler D, Lunter G. 2015. scrm: efficiently simulating long sequences using the approximated coalescent with recombination. Bioinformatics, 31 (10), pp. 1680-1682. | Show Abstract | Read more

MOTIVATION: Coalescent-based simulation software for genomic sequences allows the efficient in silico generation of short- and medium-sized genetic sequences. However, the simulation of genome-size datasets as produced by next-generation sequencing is currently only possible using fairly crude approximations. RESULTS: We present the sequential coalescent with recombination model (SCRM), a new method that efficiently and accurately approximates the coalescent with recombination, closing the gap between current approximations and the exact model. We present an efficient implementation and show that it can simulate genomic-scale datasets with an essentially correct linkage structure.

Trück J, Ramasamy MN, Galson JD, Rance R, Parkhill J, Lunter G, Pollard AJ, Kelly DF. 2015. Identification of antigen-specific B cell receptor sequences using public repertoire analysis. J Immunol, 194 (1), pp. 252-261. | Show Abstract | Read more

High-throughput sequencing allows detailed study of the BCR repertoire postimmunization, but it remains unclear to what extent the de novo identification of Ag-specific sequences from the total BCR repertoire is possible. A conjugate vaccine containing Haemophilus influenzae type b (Hib) and group C meningococcal polysaccharides, as well as tetanus toxoid (TT), was used to investigate the BCR repertoire of adult humans following immunization and to test the hypothesis that public or convergent repertoire analysis could identify Ag-specific sequences. A number of Ag-specific BCR sequences have been reported for Hib and TT, which made a vaccine containing these two Ags an ideal immunological stimulus. Analysis of identical CDR3 amino acid sequences that were shared by individuals in the postvaccine repertoire identified a number of known Hib-specific sequences but only one previously described TT sequence. The extension of this analysis to nonidentical, but highly similar, CDR3 amino acid sequences revealed a number of other TT-related sequences. The anti-Hib avidity index postvaccination strongly correlated with the relative frequency of Hib-specific sequences, indicating that the postvaccination public BCR repertoire may be related to more conventional measures of immunogenicity correlating with disease protection. Analysis of public BCR repertoire provided evidence of convergent BCR evolution in individuals exposed to the same Ags. If this finding is confirmed, the public repertoire could be used for rapid and direct identification of protective Ag-specific BCR sequences from peripheral blood.

Rimmer A, Phan H, Mathieson I, Iqbal Z, Twigg SRF, WGS500 Consortium, Wilkie AOM, McVean G, Lunter G. 2014. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat Genet, 46 (8), pp. 912-918. | Show Abstract | Read more

High-throughput DNA sequencing technology has transformed genetic research and is starting to make an impact on clinical practice. However, analyzing high-throughput sequencing data remains challenging, particularly in clinical settings where accuracy and turnaround times are critical. We present a new approach to this problem, implemented in a software package called Platypus. Platypus achieves high sensitivity and specificity for SNPs, indels and complex polymorphisms by using local de novo assembly to generate candidate variants, followed by local realignment and probabilistic haplotype estimation. It is an order of magnitude faster than existing tools and generates calls from raw aligned read data without preprocessing. We demonstrate the performance of Platypus in clinically relevant experimental designs by comparing with SAMtools and GATK on whole-genome and exome-capture data, by identifying de novo variation in 15 parent-offspring trios with high sensitivity and specificity, and by estimating human leukocyte antigen genotypes directly from variant calls.

Rands CM, Meader S, Ponting CP, Lunter G. 2014. 8.2% of the Human genome is constrained: variation in rates of turnover across functional element classes in the human lineage. PLoS Genet, 10 (7), pp. e1004525. | Show Abstract | Read more

Ten years on from the finishing of the human reference genome sequence, it remains unclear what fraction of the human genome confers function, where this sequence resides, and how much is shared with other mammalian species. When addressing these questions, functional sequence has often been equated with pan-mammalian conserved sequence. However, functional elements that are short-lived, including those contributing to species-specific biology, will not leave a footprint of long-lasting negative selection. Here, we address these issues by identifying and characterising sequence that has been constrained with respect to insertions and deletions for pairs of eutherian genomes over a range of divergences. Within noncoding sequence, we find increasing amounts of mutually constrained sequence as species pairs become more closely related, indicating that noncoding constrained sequence turns over rapidly. We estimate that half of present-day noncoding constrained sequence has been gained or lost in approximately the last 130 million years (half-life in units of divergence time, d1/2 = 0.25-0.31). While enriched with ENCODE biochemical annotations, much of the short-lived constrained sequences we identify are not detected by models optimized for wider pan-mammalian conservation. Constrained DNase 1 hypersensitivity sites, promoters and untranslated regions have been more evolutionarily stable than long noncoding RNA loci which have turned over especially rapidly. By contrast, protein coding sequence has been highly stable, with an estimated half-life of over a billion years (d1/2 = 2.1-5.0). From extrapolations we estimate that 8.2% (7.1-9.2%) of the human genome is presently subject to negative selection and thus is likely to be functional, while only 2.2% has maintained constraint in both human and mouse since these species diverged. These results reveal that the evolutionary history of the human genome has been highly dynamic, particularly for its noncoding yet biologically functional fraction.

Montgomery SB, Goode DL, Kvikstad E, Albers CA, Zhang ZD, Mu XJ, Ananda G, Howie B, Karczewski KJ, Smith KS et al. 2013. The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Res, 23 (5), pp. 749-761. | Show Abstract | Read more

Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing three diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43%-48% of indels occurring in 4.03% of the genome, whereas in the remaining 96% their prevalence is 16 times lower than SNPs. Polymerase slippage can explain upwards of three-fourths of all indels, with the remainder being mostly simple deletions in complex sequence. However, insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage, which enables us to identify indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogeneity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, and that indels affecting multiple functionally constrained nucleotides undergo stronger purifying selection. We further find that indels are enriched in associations with gene expression and find evidence for a contribution of nonsense-mediated decay. Finally, we show that indels can be integrated in existing genome-wide association studies (GWAS); although we do not find direct evidence that potentially causal protein-coding indels are enriched with associations to known disease-associated SNPs, our findings suggest that the causal variant underlying some of these associations may be indels.

Mailund T, Dutheil JY, Hobolth A, Lunter G, Schierup MH. 2011. Estimating divergence time and ancestral effective population size of Bornean and Sumatran orangutan subspecies using a coalescent hidden Markov model. PLoS Genet, 7 (3), pp. e1001319. | Show Abstract | Read more

Due to genetic variation in the ancestor of two populations or two species, the divergence time for DNA sequences from two populations is variable along the genome. Within genomic segments all bases will share the same divergence-because they share a most recent common ancestor-when no recombination event has occurred to split them apart. The size of these segments of constant divergence depends on the recombination rate, but also on the speciation time, the effective population size of the ancestral population, as well as demographic effects and selection. Thus, inference of these parameters may be possible if we can decode the divergence times along a genomic alignment. Here, we present a new hidden Markov model that infers the changing divergence (coalescence) times along the genome alignment using a coalescent framework, in order to estimate the speciation time, the recombination rate, and the ancestral effective population size. The model is efficient enough to allow inference on whole-genome data sets. We first investigate the power and consistency of the model with coalescent simulations and then apply it to the whole-genome sequences of the two orangutan sub-species, Bornean (P. p. pygmaeus) and Sumatran (P. p. abelii) orangutans from the Orangutan Genome Project. We estimate the speciation time between the two sub-species to be thousand years ago and the effective population size of the ancestral orangutan species to be , consistent with recent results based on smaller data sets. We also report a negative correlation between chromosome size and ancestral effective population size, which we interpret as a signature of recombination increasing the efficacy of selection.

Lunter G, Goodson M. 2011. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res, 21 (6), pp. 936-939. | Show Abstract | Read more

High-volume sequencing of DNA and RNA is now within reach of any research laboratory and is quickly becoming established as a key research tool. In many workflows, each of the short sequences ("reads") resulting from a sequencing run are first "mapped" (aligned) to a reference sequence to infer the read from which the genomic location derived, a challenging task because of the high data volumes and often large genomes. Existing read mapping software excel in either speed (e.g., BWA, Bowtie, ELAND) or sensitivity (e.g., Novoalign), but not in both. In addition, performance often deteriorates in the presence of sequence variation, particularly so for short insertions and deletions (indels). Here, we present a read mapper, Stampy, which uses a hybrid mapping algorithm and a detailed statistical model to achieve both speed and sensitivity, particularly when reads include sequence variation. This results in a higher useable sequence yield and improved accuracy compared to that of existing software.

Meader S, Ponting CP, Lunter G. 2010. Massive turnover of functional sequence in human and other mammalian genomes. Genome Res, 20 (10), pp. 1335-1343. | Show Abstract | Read more

Despite the availability of dozens of animal genome sequences, two key questions remain unanswered: First, what fraction of any species' genome confers biological function, and second, are apparent differences in organismal complexity reflected in an objective measure of genomic complexity? Here, we address both questions by applying, across the mammalian phylogeny, an evolutionary model that estimates the amount of functional DNA that is shared between two species' genomes. Our main findings are, first, that as the divergence between mammalian species increases, the predicted amount of pairwise shared functional sequence drops off dramatically. We show by simulations that this is not an artifact of the method, but rather indicates that functional (and mostly noncoding) sequence is turning over at a very high rate. We estimate that between 200 and 300 Mb (∼6.5%-10%) of the human genome is under functional constraint, which includes five to eight times as many constrained noncoding bases than bases that code for protein. In contrast, in D. melanogaster we estimate only 56-66 Mb to be constrained, implying a ratio of noncoding to coding constrained bases of about 2. This suggests that, rather than genome size or protein-coding gene complement, it is the number of functional bases that might best mirror our naïve preconceptions of organismal complexity.

Ponjavic J, Ponting CP, Lunter G. 2007. Functionality or transcriptional noise? Evidence for selection within long noncoding RNAs. Genome Res, 17 (5), pp. 556-565. | Show Abstract | Read more

Long transcripts that do not encode protein have only rarely been the subject of experimental scrutiny. Presumably, this is owing to the current lack of evidence of their functionality, thereby leaving an impression that, instead, they represent "transcriptional noise." Here, we describe an analysis of 3122 long and full-length, noncoding RNAs ("macroRNAs") from the mouse, and compare their sequences and their promoters with orthologous sequence from human and from rat. We considered three independent signatures of purifying selection related to substitutions, sequence insertions and deletions, and splicing. We find that the evolution of the set of noncoding RNAs is not consistent with neutralist explanations. Rather, our results indicate that purifying selection has acted on the macroRNAs' promoters, primary sequence, and consensus splice site motifs. Promoters have experienced the greatest elimination of nucleotide substitutions, insertions, and deletions. The proportion of conserved sequence (4.1%-5.5%) in these macroRNAs is comparable to the density of exons within protein-coding transcripts (5.2%). These macroRNAs, taken together, thus possess the imprint of purifying selection, thereby indicating their functionality. Our findings should now provide an incentive for the experimental investigation of these macroRNAs' functions.

208