Genetic discovery from the multitude of phenotypes extractable from routine healthcare data can transform understanding of the human phenome and accelerate progress toward precision medicine. However, a critical question when analyzing high-dimensional and heterogeneous data is how best to interrogate increasingly specific subphenotypes while retaining statistical power to detect genetic associations. Here we develop and employ a new Bayesian analysis framework that exploits the hierarchical structure of diagnosis classifications to analyze genetic variants against UK Biobank disease phenotypes derived from self-reporting and hospital episode statistics. Our method displays a more than 20% increase in power to detect genetic effects over other approaches and identifies new associations between classical human leukocyte antigen (HLA) alleles and common immune-mediated diseases (IMDs). By applying the approach to genetic risk scores (GRSs), we show the extent of genetic sharing among IMDs and expose differences in disease perception or diagnosis with potential clinical implications.
© Institute of Mathematical Statistics, 2017. The combination of genetic information with electronic patient records promises to provide a powerful new resource for understanding human disease and its treatment. Here we develop and apply a novel stochastic compartmental model to a large dataset on Clostridium difficile infection (CDI) in three Oxfordshire hospitals over a 2.5 year period which combines genetic information on 858 confirmed cases of CDI with a database of 750,000 patient records. C. difficile is a major cause of healthcare-associated diarrhoea and is responsible for substantial mortality and morbidity, with relatively little known about its biology or its transmission epidemiology. Bayesian analysis of our model, via Markov chain Monte Carlo, provides new information about the biology of CDI, including genetic heterogeneity in infectiousness across different sequence types, and evidence for ward contamination as a significant mode of transmission, and allows inferences about the contribution of particular individuals, wards or hospitals to transmission of the bacterium, and assessment of changes in these over time following changes in hospital practice. Our work demonstrates the value of using statistical modelling and computational inference on large-scale hospital patient databases and genetic data.
The genetic architecture of common traits, including the number, frequency, and effect sizes of inherited variants that contribute to individual risk, has been long debated. Genome-wide association studies have identified scores of common variants associated with type 2 diabetes, but in aggregate, these explain only a fraction of the heritability of this disease. Here, to test the hypothesis that lower-frequency variants explain much of the remainder, the GoT2D and T2D-GENES consortia performed whole-genome sequencing in 2,657 European individuals with and without diabetes, and exome sequencing in 12,940 individuals from five ancestry groups. To increase statistical power, we expanded the sample size via genotyping and imputation in a further 111,548 subjects. Variants associated with type 2 diabetes after sequencing were overwhelmingly common and most fell within regions previously identified by genome-wide association studies. Comprehensive enumeration of sequence variation is necessary to identify functional alleles that provide important clues to disease pathophysiology, but large-scale sequencing does not support the idea that lower-frequency variants have a major role in predisposition to type 2 diabetes.
Bacteremia (bacterial bloodstream infection) is a major cause of illness and death in sub-Saharan Africa but little is known about the role of human genetics in susceptibility. We conducted a genome-wide association study of bacteremia susceptibility in more than 5,000 Kenyan children as part of the Wellcome Trust Case Control Consortium 2 (WTCCC2). Both the blood-culture-proven bacteremia case subjects and healthy infants as controls were recruited from Kilifi, on the east coast of Kenya. Streptococcus pneumoniae is the most common cause of bacteremia in Kilifi and was thus the focus of this study. We identified an association between polymorphisms in a long intergenic non-coding RNA (lincRNA) gene (AC011288.2) and pneumococcal bacteremia and replicated the results in the same population (p combined = 1.69 × 10(-9); OR = 2.47, 95% CI = 1.84-3.31). The susceptibility allele is African specific, derived rather than ancestral, and occurs at low frequency (2.7% in control subjects and 6.4% in case subjects). Our further studies showed AC011288.2 expression only in neutrophils, a cell type that is known to play a major role in pneumococcal clearance. Identification of this novel association will further focus research on the role of lincRNAs in human infectious disease.
The DNA-binding protein PRDM9 directs positioning of the double-strand breaks (DSBs) that initiate meiotic recombination in mice and humans. Prdm9 is the only mammalian speciation gene yet identified and is responsible for sterility phenotypes in male hybrids of certain mouse subspecies. To investigate PRDM9 binding and its role in fertility and meiotic recombination, we humanized the DNA-binding domain of PRDM9 in C57BL/6 mice. This change repositions DSB hotspots and completely restores fertility in male hybrids. Here we show that alteration of one Prdm9 allele impacts the behaviour of DSBs controlled by the other allele at chromosome-wide scales. These effects correlate strongly with the degree to which each PRDM9 variant binds both homologues at the DSB sites it controls. Furthermore, higher genome-wide levels of such 'symmetric' PRDM9 binding associate with increasing fertility measures, and comparisons of individual hotspots suggest binding symmetry plays a downstream role in the recombination process. These findings reveal that subspecies-specific degradation of PRDM9 binding sites by meiotic drive, which steadily increases asymmetric PRDM9 binding, has impacts beyond simply changing hotspot positions, and strongly support a direct involvement in hybrid infertility. Because such meiotic drive occurs across mammals, PRDM9 may play a wider, yet transient, role in the early stages of speciation.
Genetic studies have shown that obesity risk is heritable and that, of the many common variants now associated with body mass index, those in an intron of the fat mass and obesity-associated (FTO) gene have the largest effect. The size of the UK Biobank, and its joint measurement of genetic, anthropometric and lifestyle variables, offers an unprecedented opportunity to assess gene-by-environment interactions in a way that accounts for the dependence between different factors. We jointly examine the evidence for interactions between FTO (rs1421085) and various lifestyle and environmental factors. We report interactions between the FTO variant and each of: frequency of alcohol consumption (P=3.0 × 10(-4)); deviations from mean sleep duration (P=8.0 × 10(-4)); overall diet (P=5.0 × 10(-6)), including added salt (P=1.2 × 10(-3)); and physical activity (P=3.1 × 10(-4)).
We performed fine mapping of 39 established type 2 diabetes (T2D) loci in 27,206 cases and 57,574 controls of European ancestry. We identified 49 distinct association signals at these loci, including five mapping in or near KCNQ1. 'Credible sets' of the variants most likely to drive each distinct signal mapped predominantly to noncoding sequence, implying that association with T2D is mediated through gene regulation. Credible set variants were enriched for overlap with FOXA2 chromatin immunoprecipitation binding sites in human islet and liver cells, including at MTNR1B, where fine mapping implicated rs10830963 as driving T2D association. We confirmed that the T2D risk allele for this SNP increases FOXA2-bound enhancer activity in islet- and liver-derived cells. We observed allele-specific differences in NEUROD1 binding in islet-derived cells, consistent with evidence that the T2D risk allele increases islet MTNR1B expression. Our study demonstrates how integration of genetic and genomic information can define molecular mechanisms through which variants underlying association signals exert their effects on disease.
The last few decades have utterly transformed genetics and genomics, but what might the next ten years bring? PLOS Biology asked eight leaders spanning a range of related areas to give us their predictions. Without exception, the predictions are for more data on a massive scale and of more diverse types. All are optimistic and predict enormous positive impact on scientific understanding, while a recurring theme is the benefit of such data for the transformation and personalization of medicine. Several also point out that the biggest changes will very likely be those that we don't foresee, even now.
To assess factors influencing the success of whole-genome sequencing for mainstream clinical diagnosis, we sequenced 217 individuals from 156 independent cases or families across a broad spectrum of disorders in whom previous screening had identified no pathogenic variants. We quantified the number of candidate variants identified using different strategies for variant calling, filtering, annotation and prioritization. We found that jointly calling variants across samples, filtering against both local and external databases, deploying multiple annotation tools and using familial transmission above biological plausibility contributed to accuracy. Overall, we identified disease-causing variants in 21% of cases, with the proportion increasing to 34% (23/68) for mendelian disorders and 57% (8/14) in family trios. We also discovered 32 potentially clinically actionable variants in 18 genes unrelated to the referral disorder, although only 4 were ultimately considered reportable. Our results demonstrate the value of genome sequencing for routine clinical diagnosis but also highlight many outstanding challenges.
Accurate prediction of the functional effect of genetic variation is critical for clinical genome interpretation. We systematically characterized the transcriptome effects of protein-truncating variants, a class of variants expected to have profound effects on gene function, using data from the Genotype-Tissue Expression (GTEx) and Geuvadis projects. We quantitated tissue-specific and positional effects on nonsense-mediated transcript decay and present an improved predictive model for this decay. We directly measured the effect of variants both proximal and distal to splice junctions. Furthermore, we found that robustness to heterozygous gene inactivation is not due to dosage compensation. Our results illustrate the value of transcriptome data in the functional interpretation of genetic variants.
MOTIVATION: RNA sequencing enables allele-specific expression (ASE) studies that complement standard genotype expression studies for common variants and, importantly, also allow measuring the regulatory impact of rare variants. The Genotype-Tissue Expression (GTEx) project is collecting RNA-seq data on multiple tissues of a same set of individuals and novel methods are required for the analysis of these data. RESULTS: We present a statistical method to compare different patterns of ASE across tissues and to classify genetic variants according to their impact on the tissue-wide expression profile. We focus on strong ASE effects that we are expecting to see for protein-truncating variants, but our method can also be adjusted for other types of ASE effects. We illustrate the method with a real data example on a tissue-wide expression profile of a variant causal for lipoid proteinosis, and with a simulation study to assess our method more generally.
Fine-scale genetic variation between human populations is interesting as a signature of historical demographic events and because of its potential for confounding disease studies. We use haplotype-based statistical methods to analyse genome-wide single nucleotide polymorphism (SNP) data from a carefully chosen geographically diverse sample of 2,039 individuals from the United Kingdom. This reveals a rich and detailed pattern of genetic differentiation with remarkable concordance between genetic clusters and geography. The regional genetic differentiation and differing patterns of shared ancestry with 6,209 individuals from across Europe carry clear signals of historical demographic events. We estimate the genetic contribution to southeastern England from Anglo-Saxon migrations to be under half, and identify the regions not carrying genetic material from these migrations. We suggest significant pre-Roman but post-Mesolithic movement into southeastern England from continental Europe, and show that in non-Saxon parts of the United Kingdom, there exist genetically differentiated subgroups rather than a general 'Celtic' population.
Several studies have reported that the number of crossovers increases with maternal age in humans, but others have found the opposite. Resolving the true effect has implications for understanding the maternal age effect on aneuploidies. Here, we revisit this question in the largest sample to date using single nucleotide polymorphism (SNP)-chip data, comprising over 6,000 meioses from nine cohorts. We develop and fit a hierarchical model to allow for differences between cohorts and between mothers. We estimate that over 10 years, the expected number of maternal crossovers increases by 2.1% (95% credible interval (0.98%, 3.3%)). Our results are not consistent with the larger positive and negative effects previously reported in smaller cohorts. We see heterogeneity between cohorts that is likely due to chance effects in smaller samples, or possibly to confounders, emphasizing that care should be taken when interpreting results from any specific cohort about the effect of maternal age on recombination.
Association studies have greatly refined the understanding of how variation within the human leukocyte antigen (HLA) genes influences risk of multiple sclerosis. However, the extent to which major effects are modulated by interactions is poorly characterized. We analyzed high-density SNP data on 17,465 cases and 30,385 controls from 11 cohorts of European ancestry, in combination with imputation of classical HLA alleles, to build a high-resolution map of HLA genetic risk and assess the evidence for interactions involving classical HLA alleles. Among new and previously identified class II risk alleles (HLA-DRB1*15:01, HLA-DRB1*13:03, HLA-DRB1*03:01, HLA-DRB1*08:01 and HLA-DQB1*03:02) and class I protective alleles (HLA-A*02:01, HLA-B*44:02, HLA-B*38:01 and HLA-B*55:01), we find evidence for two interactions involving pairs of class II alleles: HLA-DQA1*01:01-HLA-DRB1*15:01 and HLA-DQB1*03:01-HLA-DQB1*03:02. We find no evidence for interactions between classical HLA alleles and non-HLA risk-associated variants and estimate a minimal effect of polygenic epistasis in modulating major risk alleles.
Eur J Hum Genet, 23 (9), pp. 1113-1115. | Citations: 1 (Web of Science Lite) | Read more2015. Reply to Pembrey et al: 'ZNF277 microdeletions, specific language impairment and the meiotic mismatch methylation (3M) hypothesis'.
Myocardial infarction (MI), a leading cause of death around the world, displays a complex pattern of inheritance. When MI occurs early in life, genetic inheritance is a major component to risk. Previously, rare mutations in low-density lipoprotein (LDL) genes have been shown to contribute to MI risk in individual families, whereas common variants at more than 45 loci have been associated with MI risk in the population. Here we evaluate how rare mutations contribute to early-onset MI risk in the population. We sequenced the protein-coding regions of 9,793 genomes from patients with MI at an early age (≤50 years in males and ≤60 years in females) along with MI-free controls. We identified two genes in which rare coding-sequence mutations were more frequent in MI cases versus controls at exome-wide significance. At low-density lipoprotein receptor (LDLR), carriers of rare non-synonymous mutations were at 4.2-fold increased risk for MI; carriers of null alleles at LDLR were at even higher risk (13-fold difference). Approximately 2% of early MI cases harbour a rare, damaging mutation in LDLR; this estimate is similar to one made more than 40 years ago using an analysis of total cholesterol. Among controls, about 1 in 217 carried an LDLR coding-sequence mutation and had plasma LDL cholesterol > 190 mg dl(-1). At apolipoprotein A-V (APOA5), carriers of rare non-synonymous mutations were at 2.2-fold increased risk for MI. When compared with non-carriers, LDLR mutation carriers had higher plasma LDL cholesterol, whereas APOA5 mutation carriers had higher plasma triglycerides. Recent evidence has connected MI risk with coding-sequence mutations at two genes functionally related to APOA5, namely lipoprotein lipase and apolipoprotein C-III (refs 18, 19). Combined, these observations suggest that, as well as LDL cholesterol, disordered metabolism of triglyceride-rich lipoproteins contributes to MI risk.
Specific language impairment (SLI), an unexpected failure to develop appropriate language skills despite adequate non-verbal intelligence, is a heterogeneous multifactorial disorder with a complex genetic basis. We identified a homozygous microdeletion of 21,379 bp in the ZNF277 gene (NM_021994.2), encompassing exon 5, in an individual with severe receptive and expressive language impairment. The microdeletion was not found in the proband's affected sister or her brother who had mild language impairment. However, it was inherited from both parents, each of whom carries a heterozygous microdeletion and has a history of language problems. The microdeletion falls within the AUTS1 locus, a region linked to autistic spectrum disorders (ASDs). Moreover, ZNF277 is adjacent to the DOCK4 and IMMP2L genes, which have been implicated in ASD. We screened for the presence of ZNF277 microdeletions in cohorts of children with SLI or ASD and panels of control subjects. ZNF277 microdeletions were at an increased allelic frequency in SLI probands (1.1%) compared with both ASD family members (0.3%) and independent controls (0.4%). We performed quantitative RT-PCR analyses of the expression of IMMP2L, DOCK4 and ZNF277 in individuals carrying either an IMMP2L_DOCK4 microdeletion or a ZNF277 microdeletion. Although ZNF277 microdeletions reduce the expression of ZNF277, they do not alter the levels of DOCK4 or IMMP2L transcripts. Conversely, IMMP2L_DOCK4 microdeletions do not affect the expression levels of ZNF277. We postulate that ZNF277 microdeletions may contribute to the risk of language impairments in a manner that is independent of the autism risk loci previously described in this region.
TRANSGENIC RESEARCH, 23 (5), pp. 854-854.2014. Reprogramming meiotic recombination in the mouse
NDM-producing Klebsiella pneumoniae strains represent major clinical and infection control challenges, particularly in resource-limited settings with high rates of antimicrobial resistance. Determining whether transmission occurs at a gene, plasmid, or bacterial strain level and within hospital and/or the community has implications for monitoring and controlling spread. Whole-genome sequencing (WGS) is the highest-resolution typing method available for transmission epidemiology. We sequenced carbapenem-resistant K. pneumoniae isolates from 26 individuals involved in several infection case clusters in a Nepali neonatal unit and 68 other clinical Gram-negative isolates from a similar time frame, using Illumina and PacBio technologies. Within-outbreak chromosomal and closed-plasmid structures were generated and used as data set-specific references. Three temporally separated case clusters were caused by a single NDM K. pneumoniae strain with a conserved set of four plasmids, one being a 304,526-bp plasmid carrying bla(NDM-1). The plasmids contained a large number of antimicrobial/heavy metal resistance and plasmid maintenance genes, which may have explained their persistence. No obvious environmental/human reservoir was found. There was no evidence of transmission of outbreak plasmids to other Gram-negative clinical isolates, although bla(NDM) variants were present in other isolates in different genetic contexts. WGS can effectively define complex antimicrobial resistance epidemiology. Wider sampling frames are required to contextualize outbreaks. Infection control may be effective in terminating outbreaks caused by particular strains, even in areas with widespread resistance, although this study could not demonstrate evidence supporting specific interventions. Larger, detailed studies are needed to characterize resistance genes, vectors, and host strains involved in disease, to enable effective intervention.
The pseudoautosomal region (PAR) is a short region of homology between the mammalian X and Y chromosomes, which has undergone rapid evolution. A crossover in the PAR is essential for the proper disjunction of X and Y chromosomes in male meiosis, and PAR deletion results in male sterility. This leads the human PAR with the obligatory crossover, PAR1, to having an exceptionally high male crossover rate, which is 17-fold higher than the genome-wide average. However, the mechanism by which this obligatory crossover occurs remains unknown, as does the fine-scale positioning of crossovers across this region. Recent research in mice has suggested that crossovers in PAR may be mediated independently of the protein PRDM9, which localises virtually all crossovers in the autosomes. To investigate recombination in this region, we construct the most fine-scale genetic map containing directly observed crossovers to date using African-American pedigrees. We leverage recombination rates inferred from the breakdown of linkage disequilibrium in human populations and investigate the signatures of DNA evolution due to recombination. Further, we identify direct PRDM9 binding sites using ChIP-seq in human cells. Using these independent lines of evidence, we show that, in contrast with mouse, PRDM9 does localise peaks of recombination in the human PAR1. We find that recombination is a far more rapid and intense driver of sequence evolution in PAR1 than it is on the autosomes. We also show that PAR1 hotspot activities differ significantly among human populations. Finally, we find evidence that PAR1 hotspot positions have changed between human and chimpanzee, with no evidence of sharing among the hottest hotspots. We anticipate that the genetic maps built and validated in this work will aid research on this vital and fascinating region of the genome.
A major use of the 1000 Genomes Project (1000 GP) data is genotype imputation in genome-wide association studies (GWAS). Here we develop a method to estimate haplotypes from low-coverage sequencing data that can take advantage of single-nucleotide polymorphism (SNP) microarray genotypes on the same samples. First the SNP array data are phased to build a backbone (or 'scaffold') of haplotypes across each chromosome. We then phase the sequence data 'onto' this haplotype scaffold. This approach can take advantage of relatedness between sequenced and non-sequenced samples to improve accuracy. We use this method to create a new 1000 GP haplotype reference set for use by the human genetic community. Using a set of validation genotypes at SNP and bi-allelic indels we show that these haplotypes have lower genotype discordance and improved imputation performance into downstream GWAS samples, especially at low-frequency variants.
BACKGROUND: Variant annotation is a crucial step in the analysis of genome sequencing data. Functional annotation results can have a strong influence on the ultimate conclusions of disease studies. Incorrect or incomplete annotations can cause researchers both to overlook potentially disease-relevant DNA variants and to dilute interesting variants in a pool of false positives. Researchers are aware of these issues in general, but the extent of the dependency of final results on the choice of transcripts and software used for annotation has not been quantified in detail. METHODS: This paper quantifies the extent of differences in annotation of 80 million variants from a whole-genome sequencing study. We compare results using the RefSeq and Ensembl transcript sets as the basis for variant annotation with the software Annovar, and also compare the results from two annotation software packages, Annovar and VEP (Ensembl's Variant Effect Predictor), when using Ensembl transcripts. RESULTS: We found only 44% agreement in annotations for putative loss-of-function variants when using the RefSeq and Ensembl transcript sets as the basis for annotation with Annovar. The rate of matching annotations for loss-of-function and nonsynonymous variants combined was 79% and for all exonic variants it was 83%. When comparing results from Annovar and VEP using Ensembl transcripts, matching annotations were seen for only 65% of loss-of-function variants and 87% of all exonic variants, with splicing variants revealed as the category with the greatest discrepancy. Using these comparisons, we characterised the types of apparent errors made by Annovar and VEP and discuss their impact on the analysis of DNA variants in genome sequencing studies. CONCLUSIONS: Variant annotation is not yet a solved problem. Choice of transcript set can have a large effect on the ultimate variant annotations obtained in a whole-genome sequencing study. Choice of annotation software can also have a substantial effect. The annotation step in the analysis of a genome sequencing study must therefore be considered carefully, and a conscious choice made as to which transcript set and software are used for annotation.
BACKGROUND: Genome-wide association studies (GWAS) have identified several loci associated with schizophrenia and/or bipolar disorder. We performed a GWAS of psychosis as a broad syndrome rather than within specific diagnostic categories. METHODS: 1239 cases with schizophrenia, schizoaffective disorder, or psychotic bipolar disorder; 857 of their unaffected relatives, and 2739 healthy controls were genotyped with the Affymetrix 6.0 single nucleotide polymorphism (SNP) array. Analyses of 695,193 SNPs were conducted using UNPHASED, which combines information across families and unrelated individuals. We attempted to replicate signals found in 23 genomic regions using existing data on nonoverlapping samples from the Psychiatric GWAS Consortium and Schizophrenia-GENE-plus cohorts (10,352 schizophrenia patients and 24,474 controls). RESULTS: No individual SNP showed compelling evidence for association with psychosis in our data. However, we observed a trend for association with same risk alleles at loci previously associated with schizophrenia (one-sided p = .003). A polygenic score analysis found that the Psychiatric GWAS Consortium's panel of SNPs associated with schizophrenia significantly predicted disease status in our sample (p = 5 × 10(-14)) and explained approximately 2% of the phenotypic variance. CONCLUSIONS: Although narrowly defined phenotypes have their advantages, we believe new loci may also be discovered through meta-analysis across broad phenotypes. The novel statistical methodology we introduced to model effect size heterogeneity between studies should help future GWAS that combine association evidence from related phenotypes. Applying these approaches, we highlight three loci that warrant further investigation. We found that SNPs conveying risk for schizophrenia are also predictive of disease status in our data.
To further understanding of the genetic basis of type 2 diabetes (T2D) susceptibility, we aggregated published meta-analyses of genome-wide association studies (GWAS), including 26,488 cases and 83,964 controls of European, east Asian, south Asian and Mexican and Mexican American ancestry. We observed a significant excess in the directional consistency of T2D risk alleles across ancestry groups, even at SNPs demonstrating only weak evidence of association. By following up the strongest signals of association from the trans-ethnic meta-analysis in an additional 21,491 cases and 55,647 controls of European ancestry, we identified seven new T2D susceptibility loci. Furthermore, we observed considerable improvements in the fine-mapping resolution of common variant association signals at several T2D susceptibility loci. These observations highlight the benefits of trans-ethnic GWAS for the discovery and characterization of complex trait loci and emphasize an exciting opportunity to extend insight into the genetic architecture and pathogenesis of human diseases across populations of diverse ancestry.
In severe early-onset epilepsy, precise clinical and molecular genetic diagnosis is complex, as many metabolic and electro-physiological processes have been implicated in disease causation. The clinical phenotypes share many features such as complex seizure types and developmental delay. Molecular diagnosis has historically been confined to sequential testing of candidate genes known to be associated with specific sub-phenotypes, but the diagnostic yield of this approach can be low. We conducted whole-genome sequencing (WGS) on six patients with severe early-onset epilepsy who had previously been refractory to molecular diagnosis, and their parents. Four of these patients had a clinical diagnosis of Ohtahara Syndrome (OS) and two patients had severe non-syndromic early-onset epilepsy (NSEOE). In two OS cases, we found de novo non-synonymous mutations in the genes KCNQ2 and SCN2A. In a third OS case, WGS revealed paternal isodisomy for chromosome 9, leading to identification of the causal homozygous missense variant in KCNT1, which produced a substantial increase in potassium channel current. The fourth OS patient had a recessive mutation in PIGQ that led to exon skipping and defective glycophosphatidyl inositol biosynthesis. The two patients with NSEOE had likely pathogenic de novo mutations in CBL and CSNK1G1, respectively. Mutations in these genes were not found among 500 additional individuals with epilepsy. This work reveals two novel genes for OS, KCNT1 and PIGQ. It also uncovers unexpected genetic mechanisms and emphasizes the power of WGS as a clinical tool for making molecular diagnoses, particularly for highly heterogeneous disorders.
Bladder cancers are a leading cause of death from malignancy. Molecular markers might predict disease progression and behaviour more accurately than the available prognostic factors. Here we use whole-genome sequencing to identify somatic mutations and chromosomal changes in 14 bladder cancers of different grades and stages. As well as detecting the known bladder cancer driver mutations, we report the identification of recurrent protein-inactivating mutations in CDKN1A and FAT1. The former are not mutually exclusive with TP53 mutations or MDM2 amplification, showing that CDKN1A dysfunction is not simply an alternative mechanism for p53 pathway inactivation. We find strong positive associations between higher tumour stage/grade and greater clonal diversity, the number of somatic mutations and the burden of copy number changes. In principle, the identification of sub-clones with greater diversity and/or mutation burden within early-stage or low-grade tumours could identify lesions with a high risk of invasive progression.
Dissecting how genetic and environmental influences impact on learning is helpful for maximizing numeracy and literacy. Here we show, using twin and genome-wide analysis, that there is a substantial genetic component to children's ability in reading and mathematics, and estimate that around one half of the observed correlation in these traits is due to shared genetic effects (so-called Generalist Genes). Thus, our results highlight the potential role of the learning environment in contributing to differences in a child's cognitive abilities at age twelve.
To discover quantitative trait loci for intraocular pressure, a major risk factor for glaucoma and the only modifiable one, we performed a genome-wide association study on a discovery cohort of 2175 individuals from Sydney, Australia. We found a novel association between intraocular pressure and a common variant at 7p21 near to GLCCI1 and ICA1. The findings in this region were confirmed through two UK replication cohorts totalling 4866 individuals (rs59072263, P(combined) = 1.10 × 10(-8)). A copy of the G allele at this SNP is associated with an increase in mean IOP of 0.45 mmHg (95%CI = 0.30-0.61 mmHg). These results lend support to the implication of vesicle trafficking and glucocorticoid inducibility pathways in the determination of intraocular pressure and in the pathogenesis of primary open-angle glaucoma.
Using the ImmunoChip custom genotyping array, we analyzed 14,498 subjects with multiple sclerosis and 24,091 healthy controls for 161,311 autosomal variants and identified 135 potentially associated regions (P < 1.0 × 10(-4)). In a replication phase, we combined these data with previous genome-wide association study (GWAS) data from an independent 14,802 subjects with multiple sclerosis and 26,703 healthy controls. In these 80,094 individuals of European ancestry, we identified 48 new susceptibility variants (P < 5.0 × 10(-8)), 3 of which we found after conditioning on previously identified variants. Thus, there are now 110 established multiple sclerosis risk variants at 103 discrete loci outside of the major histocompatibility complex. With high-resolution Bayesian fine mapping, we identified five regions where one variant accounted for more than 50% of the posterior probability of association. This study enhances the catalog of multiple sclerosis risk variants and illustrates the value of fine mapping in the resolution of GWAS signals.
Genome sequencing projects are discovering millions of genetic variants in humans, and interpretation of their functional effects is essential for understanding the genetic basis of variation in human traits. Here we report sequencing and deep analysis of messenger RNA and microRNA from lymphoblastoid cell lines of 462 individuals from the 1000 Genomes Project--the first uniformly processed high-throughput RNA-sequencing data from multiple human populations with high-quality genome sequences. We discover extremely widespread genetic variation affecting the regulation of most genes, with transcript structure and expression level variation being equally common but genetically largely independent. Our characterization of causal regulatory variation sheds light on the cellular mechanisms of regulatory and loss-of-function variation, and allows us to infer putative causal variants for dozens of disease-associated loci. Altogether, this study provides a deep understanding of the cellular mechanisms of transcriptome variation and of the landscape of functional variants in the human genome.
Schizophrenia is an idiopathic mental disorder with a heritable component and a substantial public health impact. We conducted a multi-stage genome-wide association study (GWAS) for schizophrenia beginning with a Swedish national sample (5,001 cases and 6,243 controls) followed by meta-analysis with previous schizophrenia GWAS (8,832 cases and 12,067 controls) and finally by replication of SNPs in 168 genomic regions in independent samples (7,413 cases, 19,762 controls and 581 parent-offspring trios). We identified 22 loci associated at genome-wide significance; 13 of these are new, and 1 was previously implicated in bipolar disorder. Examination of candidate genes at these loci suggests the involvement of neuronal calcium signaling. We estimate that 8,300 independent, mostly common SNPs (95% credible interval of 6,300-10,200 SNPs) contribute to risk for schizophrenia and that these collectively account for at least 32% of the variance in liability. Common genetic variation has an important role in the etiology of schizophrenia, and larger studies will allow more detailed understanding of this disorder.
MOTIVATION: In sequencing studies of common diseases and quantitative traits, power to test rare and low frequency variants individually is weak. To improve power, a common approach is to combine statistical evidence from several genetic variants in a region. Major challenges are how to do the combining and which statistical framework to use. General approaches for testing association between rare variants and quantitative traits include aggregating genotypes and trait values, referred to as 'collapsing', or using a score-based variance component test. However, little attention has been paid to alternative models tailored for protein truncating variants. Recent studies have highlighted the important role that protein truncating variants, commonly referred to as 'loss of function' variants, may have on disease susceptibility and quantitative levels of biomarkers. We propose a Bayesian modelling framework for the analysis of protein truncating variants and quantitative traits. RESULTS: Our simulation results show that our models have an advantage over the commonly used methods. We apply our models to sequence and exome-array data and discover strong evidence of association between low plasma triglyceride levels and protein truncating variants at APOC3 (Apolipoprotein C3). AVAILABILITY: Software is available from http://www.well.ox.ac.uk/~rivas/mamba
Ankylosing spondylitis is a common, highly heritable inflammatory arthritis affecting primarily the spine and pelvis. In addition to HLA-B*27 alleles, 12 loci have previously been identified that are associated with ankylosing spondylitis in populations of European ancestry, and 2 associated loci have been identified in Asians. In this study, we used the Illumina Immunochip microarray to perform a case-control association study involving 10,619 individuals with ankylosing spondylitis (cases) and 15,145 controls. We identified 13 new risk loci and 12 additional ankylosing spondylitis-associated haplotypes at 11 loci. Two ankylosing spondylitis-associated regions have now been identified encoding four aminopeptidases that are involved in peptide processing before major histocompatibility complex (MHC) class I presentation. Protective variants at two of these loci are associated both with reduced aminopeptidase function and with MHC class I cell surface expression.
The congenital dyserythropoietic anemias are a heterogeneous group of rare disorders primarily affecting erythropoiesis with characteristic morphological abnormalities and a block in erythroid maturation. Mutations in the CDAN1 gene, which encodes Codanin-1, underlie the majority of congenital dyserythropoietic anemia type I cases. However, no likely pathogenic CDAN1 mutation has been detected in approximately 20% of cases, suggesting the presence of at least one other locus. We used whole genome sequencing and segregation analysis to identify a homozygous T to A transversion (c.533T>A), predicted to lead to a p.L178Q missense substitution in C15ORF41, a gene of unknown function, in a consanguineous pedigree of Middle-Eastern origin. Sequencing C15ORF41 in other CDAN1 mutation-negative congenital dyserythropoietic anemia type I pedigrees identified a homozygous transition (c.281A>G), predicted to lead to a p.Y94C substitution, in two further pedigrees of SouthEast Asian origin. The haplotype surrounding the c.281A>G change suggests a founder effect for this mutation in Pakistan. Detailed sequence similarity searches indicate that C15ORF41 encodes a novel restriction endonuclease that is a member of the Holliday junction resolvase family of proteins.
Though difficult, the study of gene-environment interactions in multifactorial diseases is crucial for interpreting the relevance of non-heritable factors and prevents from overlooking genetic associations with small but measurable effects. We propose a "candidate interactome" (i.e. a group of genes whose products are known to physically interact with environmental factors that may be relevant for disease pathogenesis) analysis of genome-wide association data in multiple sclerosis. We looked for statistical enrichment of associations among interactomes that, at the current state of knowledge, may be representative of gene-environment interactions of potential, uncertain or unlikely relevance for multiple sclerosis pathogenesis: Epstein-Barr virus, human immunodeficiency virus, hepatitis B virus, hepatitis C virus, cytomegalovirus, HHV8-Kaposi sarcoma, H1N1-influenza, JC virus, human innate immunity interactome for type I interferon, autoimmune regulator, vitamin D receptor, aryl hydrocarbon receptor and a panel of proteins targeted by 70 innate immune-modulating viral open reading frames from 30 viral species. Interactomes were either obtained from the literature or were manually curated. The P values of all single nucleotide polymorphism mapping to a given interactome were obtained from the last genome-wide association study of the International Multiple Sclerosis Genetics Consortium & the Wellcome Trust Case Control Consortium, 2. The interaction between genotype and Epstein Barr virus emerges as relevant for multiple sclerosis etiology. However, in line with recent data on the coexistence of common and unique strategies used by viruses to perturb the human molecular system, also other viruses have a similar potential, though probably less relevant in epidemiological terms.
Congenital myasthenic syndromes are a heterogeneous group of inherited disorders that arise from impaired signal transmission at the neuromuscular synapse. They are characterized by fatigable muscle weakness. We performed linkage analysis, whole-exome and whole-genome sequencing to determine the underlying defect in patients with an inherited limb-girdle pattern of myasthenic weakness. We identify ALG14 and ALG2 as novel genes in which mutations cause a congenital myasthenic syndrome. Through analogy with yeast, ALG14 is thought to form a multiglycosyltransferase complex with ALG13 and DPAGT1 that catalyses the first two committed steps of asparagine-linked protein glycosylation. We show that ALG14 is concentrated at the muscle motor endplates and small interfering RNA silencing of ALG14 results in reduced cell-surface expression of muscle acetylcholine receptor expressed in human embryonic kidney 293 cells. ALG2 is an alpha-1,3-mannosyltransferase that also catalyses early steps in the asparagine-linked glycosylation pathway. Mutations were identified in two kinships, with mutation ALG2p.Val68Gly found to severely reduce ALG2 expression both in patient muscle, and in cell cultures. Identification of DPAGT1, ALG14 and ALG2 mutations as a cause of congenital myasthenic syndrome underscores the importance of asparagine-linked protein glycosylation for proper functioning of the neuromuscular junction. These syndromes form part of the wider spectrum of congenital disorders of glycosylation caused by impaired asparagine-linked glycosylation. It is likely that further genes encoding components of this pathway will be associated with congenital myasthenic syndromes or impaired neuromuscular transmission as part of a more severe multisystem disorder. Our findings suggest that treatment with cholinesterase inhibitors may improve muscle function in many of the congenital disorders of glycosylation.
To identify susceptibility loci for visceral leishmaniasis, we undertook genome-wide association studies in two populations: 989 cases and 1,089 controls from India and 357 cases in 308 Brazilian families (1,970 individuals). The HLA-DRB1-HLA-DQA1 locus was the only region to show strong evidence of association in both populations. Replication at this region was undertaken in a second Indian population comprising 941 cases and 990 controls, and combined analysis across the three cohorts for rs9271858 at this locus showed P(combined) = 2.76 × 10(-17) and odds ratio (OR) = 1.41, 95% confidence interval (CI) = 1.30-1.52. A conditional analysis provided evidence for multiple associations within the HLA-DRB1-HLA-DQA1 region, and a model in which risk differed between three groups of haplotypes better explained the signal and was significant in the Indian discovery and replication cohorts. In conclusion, the HLA-DRB1-HLA-DQA1 HLA class II region contributes to visceral leishmaniasis susceptibility in India and Brazil, suggesting shared genetic risk factors for visceral leishmaniasis that cross the epidemiological divides of geography and parasite species.
Improved sequencing technologies offer unprecedented opportunities for investigating the role of rare genetic variation in common disease. However, there are considerable challenges with respect to study design, data analysis and replication. Using pooled next-generation sequencing of 507 genes implicated in the repair of DNA in 1,150 samples, an analytical strategy focused on protein-truncating variants (PTVs) and a large-scale sequencing case-control replication experiment in 13,642 individuals, here we show that rare PTVs in the p53-inducible protein phosphatase PPM1D are associated with predisposition to breast cancer and ovarian cancer. PPM1D PTV mutations were present in 25 out of 7,781 cases versus 1 out of 5,861 controls (P = 1.12 × 10(-5)), including 18 mutations in 6,912 individuals with breast cancer (P = 2.42 × 10(-4)) and 12 mutations in 1,121 individuals with ovarian cancer (P = 3.10 × 10(-9)). Notably, all of the identified PPM1D PTVs were mosaic in lymphocyte DNA and clustered within a 370-base-pair region in the final exon of the gene, carboxy-terminal to the phosphatase catalytic domain. Functional studies demonstrate that the mutations result in enhanced suppression of p53 in response to ionizing radiation exposure, suggesting that the mutant alleles encode hyperactive PPM1D isoforms. Thus, although the mutations cause premature protein truncation, they do not result in the simple loss-of-function effect typically associated with this class of variant, but instead probably have a gain-of-function effect. Our results have implications for the detection and management of breast and ovarian cancer risk. More generally, these data provide new insights into the role of rare and of mosaic genetic variants in common conditions, and the use of sequencing in their identification.
Instances in which natural selection maintains genetic variation in a population over millions of years are thought to be extremely rare. We conducted a genome-wide scan for long-lived balancing selection by looking for combinations of SNPs shared between humans and chimpanzees. In addition to the major histocompatibility complex, we identified 125 regions in which the same haplotypes are segregating in the two species, all but two of which are noncoding. In six cases, there is evidence for an ancestral polymorphism that persisted to the present in humans and chimpanzees. Regions with shared haplotypes are significantly enriched for membrane glycoproteins, and a similar trend is seen among shared coding polymorphisms. These findings indicate that ancient balancing selection has shaped human variation and point to genes involved in host-pathogen interactions as common targets.
Congenital myasthenic syndromes are a heterogeneous group of inherited disorders that arise from impaired signal transmission at the neuromuscular synapse. They are characterized by fatigable muscle weakness. We performed linkage analysis, whole-exome and whole-genome sequencing to determine the underlying defect in patients with an inherited limb-girdle pattern of myasthenic weakness. We identify ALG14 and ALG2 as novel genes in which mutations cause a congenital myasthenic syndrome. Through analogy with yeast, ALG14 is thought to form a multiglycosyltransferase complex with ALG13 and DPAGT1 that catalyses the first two committed steps of asparagine-linked protein glycosylation. We show that ALG14 is concentrated at the muscle motor endplates and small interfering RNA silencing of ALG14 results in reduced cell-surface expression of muscle acetylcholine receptor expressed in human embryonic kidney 293 cells. ALG2 is an alpha-1,3-mannosyltransferase that also catalyses early steps in the asparagine-linked glycosylation pathway. Mutations were identified in two kinships, with mutation ALG2p.Val68Gly found to severely reduce ALG2 expression both in patient muscle, and in cell cultures. Identification of DPAGT1, ALG14 and ALG2 mutations as a cause of congenital myasthenic syndrome underscores the importance of asparagine-linked protein glycosylation for proper functioning of the neuromuscular junction. These syndromes form part of the wider spectrum of congenital disorders of glycosylation caused by impaired asparagine-linked glycosylation. It is likely that further genes encoding components of this pathway will be associated with congenital myasthenic syndromes or impaired neuromuscular transmission as part of a more severe multisystem disorder. Our findings suggest that treatment with cholinesterase inhibitors may improve muscle function in many of the congenital disorders of glycosylation. © 2013 The Author (2013). Published by Oxford University Press on behalf of the Guarantors of Brain.
Motivated by genome-wide association studies, we consider a standard linear model with one additional random effect in situations where many predictors have been collected on the same subjects and each predictor is analyzed separately. Three novel contributions are (1) a transformation between the linear and log-odds scales which is accurate for the important genetic case of small effect sizes; (2) a likelihood-maximization algorithm that is an order of magnitude faster than the previously published approaches; and (3) efficient methods for computing marginal likelihoods which allow Bayesian model comparison. The methodology has been successfully applied to a large-scale association study of multiple sclerosis including over 20,000 individuals and 500,000 genetic variants. © 2013 Institute of Mathematical Statistics.
Int Microbiol, 16 (2), pp. 125-132. | Citations: 3 (Web of Science Lite) | Read more2013. Accessibility, sustainability, excellence: how to expand access to research publications. Executive summary.
Genome sequencing projects are discovering millions of genetic variants in humans, and interpretation of their functional effects is essential for understanding the genetic basis of variation in human traits. Here we report sequencing and deep analysis of messenger RNA and microRNA from lymphoblastoid cell lines of 462 individuals from the 1000 Genomes Project - the first uniformly processed high-throughput RNA-sequencing data from multiple human populations with high-quality genome sequences. We discover extremely widespread genetic variation affecting the regulation of most genes, with transcript structure and expression level variation being equally common but genetically largely independent. Our characterization of causal regulatory variation sheds light on the cellular mechanisms of regulatory and loss-of-function variation, and allows us to infer putative causal variants for dozens of disease-associated loci. Altogether, this study provides a deep understanding of the cellular mechanisms of transcriptome variation and of the landscape of functional variants in the human genome. © 2013 Macmillan Publishers Limited. All rights reserved.
Many individuals with multiple or large colorectal adenomas or early-onset colorectal cancer (CRC) have no detectable germline mutations in the known cancer predisposition genes. Using whole-genome sequencing, supplemented by linkage and association analysis, we identified specific heterozygous POLE or POLD1 germline variants in several multiple-adenoma and/or CRC cases but in no controls. The variants associated with susceptibility, POLE p.Leu424Val and POLD1 p.Ser478Asn, have high penetrance, and POLD1 mutation was also associated with endometrial cancer predisposition. The mutations map to equivalent sites in the proofreading (exonuclease) domain of DNA polymerases ɛ and δ and are predicted to cause a defect in the correction of mispaired bases inserted during DNA replication. In agreement with this prediction, the tumors from mutation carriers were microsatellite stable but tended to acquire base substitution mutations, as confirmed by yeast functional assays. Further analysis of published data showed that the recently described group of hypermutant, microsatellite-stable CRCs is likely to be caused by somatic POLE mutations affecting the exonuclease domain.
β-III spectrin is present in the brain and is known to be important in the function of the cerebellum. Heterozygous mutations in SPTBN2, the gene encoding β-III spectrin, cause Spinocerebellar Ataxia Type 5 (SCA5), an adult-onset, slowly progressive, autosomal-dominant pure cerebellar ataxia. SCA5 is sometimes known as "Lincoln ataxia," because the largest known family is descended from relatives of the United States President Abraham Lincoln. Using targeted capture and next-generation sequencing, we identified a homozygous stop codon in SPTBN2 in a consanguineous family in which childhood developmental ataxia co-segregates with cognitive impairment. The cognitive impairment could result from mutations in a second gene, but further analysis using whole-genome sequencing combined with SNP array analysis did not reveal any evidence of other mutations. We also examined a mouse knockout of β-III spectrin in which ataxia and progressive degeneration of cerebellar Purkinje cells has been previously reported and found morphological abnormalities in neurons from prefrontal cortex and deficits in object recognition tasks, consistent with the human cognitive phenotype. These data provide the first evidence that β-III spectrin plays an important role in cortical brain development and cognition, in addition to its function in the cerebellum; and we conclude that cognitive impairment is an integral part of this novel recessive ataxic syndrome, Spectrin-associated Autosomal Recessive Cerebellar Ataxia type 1 (SPARCA1). In addition, the identification of SPARCA1 and normal heterozygous carriers of the stop codon in SPTBN2 provides insights into the mechanism of molecular dominance in SCA5 and demonstrates that the cell-specific repertoire of spectrin subunits underlies a novel group of disorders, the neuronal spectrinopathies, which includes SCA5, SPARCA1, and a form of West syndrome.
To gain further insight into the genetic architecture of psoriasis, we conducted a meta-analysis of 3 genome-wide association studies (GWAS) and 2 independent data sets genotyped on the Immunochip, including 10,588 cases and 22,806 controls. We identified 15 new susceptibility loci, increasing to 36 the number associated with psoriasis in European individuals. We also identified, using conditional analyses, five independent signals within previously known loci. The newly identified loci shared with other autoimmune diseases include candidate genes with roles in regulating T-cell function (such as RUNX3, TAGAP and STAT3). Notably, they included candidate genes whose products are involved in innate host defense, including interferon-mediated antiviral responses (DDX58), macrophage activation (ZC3H12C) and nuclear factor (NF)-κB signaling (CARD14 and CARM1). These results portend a better understanding of shared and distinctive genetic determinants of immune-mediated inflammatory disorders and emphasize the importance of the skin in innate and acquired host defense.
To further investigate susceptibility loci identified by genome-wide association studies, we genotyped 5,500 SNPs across 14 associated regions in 8,000 samples from a control group and 3 diseases: type 2 diabetes (T2D), coronary artery disease (CAD) and Graves' disease. We defined, using Bayes theorem, credible sets of SNPs that were 95% likely, based on posterior probability, to contain the causal disease-associated SNPs. In 3 of the 14 regions, TCF7L2 (T2D), CTLA4 (Graves' disease) and CDKN2A-CDKN2B (T2D), much of the posterior probability rested on a single SNP, and, in 4 other regions (CDKN2A-CDKN2B (CAD) and CDKAL1, FTO and HHEX (T2D)), the 95% sets were small, thereby excluding most SNPs as potentially causal. Very few SNPs in our credible sets had annotated functions, illustrating the limitations in understanding the mechanisms underlying susceptibility to common diseases. Our results also show the value of more detailed mapping to target sequences for functional studies.
BACKGROUND: We performed a genome-wide association study (GWAS) to identify common risk variants for schizophrenia. METHODS: The discovery scan included 1606 patients and 1794 controls from Ireland, using 6,212,339 directly genotyped or imputed single nucleotide polymorphisms (SNPs). A subset of this sample (270 cases and 860 controls) was subsequently included in the Psychiatric GWAS Consortium-schizophrenia GWAS meta-analysis. RESULTS: One hundred eight SNPs were taken forward for replication in an independent sample of 13,195 cases and 31,021 control subjects. The most significant associations in discovery, corrected for genomic inflation, were (rs204999, p combined = 1.34 × 10(-9) and in combined samples (rs2523722 p combined = 2.88 × 10(-16)) mapped to the major histocompatibility complex (MHC) region. We imputed classical human leukocyte antigen (HLA) alleles at the locus; the most significant finding was with HLA-C*01:02. This association was distinct from the top SNP signal. The HLA alleles DRB1*03:01 and B*08:01 were protective, replicating a previous study. CONCLUSIONS: This study provides further support for involvement of MHC class I molecules in schizophrenia. We found evidence of association with previously reported risk alleles at the TCF4, VRK2, and ZNF804A loci.
Barrett's esophagus is an increasingly common disease that is strongly associated with reflux of stomach acid and usually a hiatus hernia, and it strongly predisposes to esophageal adenocarcinoma (EAC), a tumor with a very poor prognosis. We report the first genome-wide association study on Barrett's esophagus, comprising 1,852 UK cases and 5,172 UK controls in the discovery stage and 5,986 cases and 12,825 controls in the replication stage. Variants at two loci were associated with disease risk: chromosome 6p21, rs9257809 (Pcombined=4.09×10(-9); odds ratio (OR)=1.21, 95% confidence interval (CI)=1.13-1.28), within the major histocompatibility complex locus, and chromosome 16q24, rs9936833 (Pcombined=2.74×10(-10); OR=1.14, 95% CI=1.10-1.19), for which the closest protein-coding gene is FOXF1, which is implicated in esophageal development and structure. We found evidence that many common variants of small effect contribute to genetic susceptibility to Barrett's esophagus and that SNP alleles predisposing to obesity also increase risk for Barrett's esophagus.
To extend understanding of the genetic architecture and molecular basis of type 2 diabetes (T2D), we conducted a meta-analysis of genetic variants on the Metabochip, including 34,840 cases and 114,981 controls, overwhelmingly of European descent. We identified ten previously unreported T2D susceptibility loci, including two showing sex-differentiated association. Genome-wide analyses of these data are consistent with a long tail of additional common variant loci explaining much of the variation in susceptibility to T2D. Exploration of the enlarged set of susceptibility loci implicates several processes, including CREBBP-related transcription, adipocytokine signaling and cell cycle regulation, in diabetes pathogenesis.
Nat Genet, 44 (4), pp. 361-362. | Citations: 11 (Web of Science Lite) | Read more2012. The role of ATM in response to metformin treatment and activation of AMPK.
Streptococcus pneumoniae ('pneumococcus') causes an estimated 14.5 million cases of serious disease and 826,000 deaths annually in children under 5 years of age(1). The highly effective introduction of the PCV7 pneumococcal vaccine in 2000 in the United States(2,3) provided an unprecedented opportunity to investigate the response of an important pathogen to widespread, vaccine-induced selective pressure. Here, we use array-based sequencing of 62 isolates from a US national monitoring program to study five independent instances of vaccine escape recombination(4), showing the simultaneous transfer of multiple and often large (up to at least 44 kb) DNA fragments. We show that one such new strain quickly became established, spreading from east to west across the United States. These observations clarify the roles of recombination and selection in the population genomics of pneumococcus and provide proof of principle of the considerable value of combining genomic and epidemiological information in the surveillance and enhanced understanding of infectious diseases.
In spite of its evolutionary significance and conservation importance, the population structure of the common chimpanzee, Pan troglodytes, is still poorly understood. An issue of particular controversy is whether the proposed fourth subspecies of chimpanzee, Pan troglodytes ellioti, from parts of Nigeria and Cameroon, is genetically distinct. Although modern high-throughput SNP genotyping has had a major impact on our understanding of human population structure and demographic history, its application to ecological, demographic, or conservation questions in non-human species has been extremely limited. Here we apply these tools to chimpanzee population structure, using ∼700 autosomal SNPs derived from chimpanzee genomic data and a further ∼100 SNPs from targeted re-sequencing. We demonstrate conclusively the existence of P. t. ellioti as a genetically distinct subgroup. We show that there is clear differentiation between the verus, troglodytes, and ellioti populations at the SNP and haplotype level, on a scale that is greater than that separating continental human populations. Further, we show that only a small set of SNPs (10-20) is needed to successfully assign individuals to these populations. Tellingly, use of only mitochondrial DNA variation to classify individuals is erroneous in 4 of 54 cases, reinforcing the dangers of basing demographic inference on a single locus and implying that the demographic history of the species is more complicated than that suggested analyses based solely on mtDNA. In this study we demonstrate the feasibility of developing economical and robust tests of individual chimpanzee origin as well as in-depth studies of population structure. These findings have important implications for conservation strategies and our understanding of the evolution of chimpanzees. They also act as a proof-of-principle for the use of cheap high-throughput genomic methods for ecological questions.
Genetic factors have been implicated in stroke risk, but few replicated associations have been reported. We conducted a genome-wide association study (GWAS) for ischemic stroke and its subtypes in 3,548 affected individuals and 5,972 controls, all of European ancestry. Replication of potential signals was performed in 5,859 affected individuals and 6,281 controls. We replicated previous associations for cardioembolic stroke near PITX2 and ZFHX3 and for large vessel stroke at a 9p21 locus. We identified a new association for large vessel stroke within HDAC9 (encoding histone deacetylase 9) on chromosome 7p21.1 (including further replication in an additional 735 affected individuals and 28,583 controls) (rs11984041; combined P = 1.87 × 10(-11); odds ratio (OR) = 1.42, 95% confidence interval (CI) = 1.28-1.57). All four loci exhibited evidence for heterogeneity of effect across the stroke subtypes, with some and possibly all affecting risk for only one subtype. This suggests distinct genetic architectures for different stroke subtypes.
SUMMARY: High-throughput genotyping arrays provide an efficient way to survey single nucleotide polymorphisms (SNPs) across the genome in large numbers of individuals. Downstream analysis of the data, for example in genome-wide association studies (GWAS), often involves statistical models of genotype frequencies across individuals. The complexities of the sample collection process and the potential for errors in the experimental assay can lead to biases and artefacts in an individual's inferred genotypes. Rather than attempting to model these complications, it has become a standard practice to remove individuals whose genome-wide data differ from the sample at large. Here we describe a simple, but robust, statistical algorithm to identify samples with atypical summaries of genome-wide variation. Its use as a semi-automated quality control tool is demonstrated using several summary statistics, selected to identify different potential problems, and it is applied to two different genotyping platforms and sample collections. AVAILABILITY: The algorithm is written in R and is freely available at www.well.ox.ac.uk/chris-spencer CONTACT: email@example.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
There is a great deal of interest in a fine-scale population structure in the UK, both as a signature of historical immigration events and because of the effect population structure may have on disease association studies. Although population structure appears to have a minor impact on the current generation of genome-wide association studies, it is likely to have a significant part in the next generation of studies designed to search for rare variants. A powerful way of detecting such structure is to control and document carefully the provenance of the samples involved. In this study, we describe the collection of a cohort of rural UK samples (The People of the British Isles), aimed at providing a well-characterised UK-control population that can be used as a resource by the research community, as well as providing a fine-scale genetic information on the British population. So far, some 4000 samples have been collected, the majority of which fit the criteria of coming from a rural area and having all four grandparents from approximately the same area. Analysis of the first 3865 samples that have been geocoded indicates that 75% have a mean distance between grandparental places of birth of 37.3 km, and that about 70% of grandparental places of birth can be classed as rural. Preliminary genotyping of 1057 samples demonstrates the value of these samples for investigating a fine-scale population structure within the UK, and shows how this can be enhanced by the use of surnames. © 2012 Macmillan Publishers Limited All rights reserved.
There have been few definitive examples of gene-gene interactions in humans. Through mutational analyses in 7325 individuals, we report four interactions (defined as departures from a multiplicative model) between mutations in the breast cancer susceptibility genes ATM and CHEK2 with BRCA1 and BRCA2 (case-only interaction between ATM and BRCA1/BRCA2 combined, P = 5.9 × 10(-4); ATM and BRCA1, P= 0.01; ATM and BRCA2, P= 0.02; CHEK2 and BRCA1/BRCA2 combined, P = 2.1 × 10(-4); CHEK2 and BRCA1, P= 0.01; CHEK2 and BRCA2, P= 0.01). The interactions are such that the resultant risk of breast cancer is lower than the multiplicative product of the constituent risks, and plausibly reflect the functional relationships of the encoded proteins in DNA repair. These findings have important implications for models of disease predisposition and clinical translation.
Streptococcus pneumoniae ('pneumococcus') causes an estimated 14.5 million cases of serious disease and 826,000 deaths annually in children under 5 years of age. The highly effective introduction of the PCV7 pneumococcal vaccine in 2000 in the United States provided an unprecedented opportunity to investigate the response of an important pathogen to widespread, vaccine-induced selective pressure. Here, we use array-based sequencing of 62 isolates from a US national monitoring program to study five independent instances of vaccine escape recombination, showing the simultaneous transfer of multiple and often large (up to at least 44 kb) DNA fragments. We show that one such new strain quickly became established, spreading from east to west across the United States. These observations clarify the roles of recombination and selection in the population genomics of pneumococcus and provide proof of principle of the considerable value of combining genomic and epidemiological information in the surveillance and enhanced understanding of infectious diseases. © 2012 Nature America, Inc. All rights reserved.
Nature Genetics, 44 (3), pp. 328-333. | Citations: 1 (Scopus)2012. Genome-wide association study identifies a variant in HDAC9 associated with large vessel ischemic stroke
Nature Genetics, 44 (4), pp. 361-362. | Citations: 11 (Scopus) | Read more2012. Zhou et al. reply
To study the evolution of recombination rates in apes, we developed methodology to construct a fine-scale genetic map from high-throughput sequence data from 10 Western chimpanzees, Pan troglodytes verus. Compared to the human genetic map, broad-scale recombination rates tend to be conserved, but with exceptions, particularly in regions of chromosomal rearrangements and around the site of ancestral fusion in human chromosome 2. At fine scales, chimpanzee recombination is dominated by hotspots, which show no overlap with those of humans even though rates are similarly elevated around CpG islands and decreased within genes. The hotspot-specifying protein PRDM9 shows extensive variation among Western chimpanzees, and there is little evidence that any sequence motifs are enriched in hotspots. The contrasting locations of hotspots provide a natural experiment, which demonstrates the impact of recombination on base composition.
OBJECTIVES: To investigate the prospects of newly available benchtop sequencers to provide rapid whole-genome data in routine clinical practice. Next-generation sequencing has the potential to resolve uncertainties surrounding the route and timing of person-to-person transmission of healthcare-associated infection, which has been a major impediment to optimal management. DESIGN: The authors used Illumina MiSeq benchtop sequencing to undertake case studies investigating potential outbreaks of methicillin-resistant Staphylococcus aureus (MRSA) and Clostridium difficile. SETTING: Isolates were obtained from potential outbreaks associated with three UK hospitals. PARTICIPANTS: Isolates were sequenced from a cluster of eight MRSA carriers and an associated bacteraemia case in an intensive care unit, another MRSA cluster of six cases and two clusters of C difficile. Additionally, all C difficile isolates from cases over 6 weeks in a single hospital were rapidly sequenced and compared with local strain sequences obtained in the preceding 3 years. MAIN OUTCOME MEASURE: Whole-genome genetic relatedness of the isolates within each epidemiological cluster. RESULTS: Twenty-six MRSA and 15 C difficile isolates were successfully sequenced and analysed within 5 days of culture. Both MRSA clusters were identified as outbreaks, with most sequences in each cluster indistinguishable and all within three single nucleotide variants (SNVs). Epidemiologically unrelated isolates of the same spa-type were genetically distinct (≥21 SNVs). In both C difficile clusters, closely epidemiologically linked cases (in one case sharing the same strain type) were shown to be genetically distinct (≥144 SNVs). A reconstruction applying rapid sequencing in C difficile surveillance provided early outbreak detection and identified previously undetected probable community transmission. CONCLUSIONS: This benchtop sequencing technology is widely generalisable to human bacterial pathogens. The findings provide several good examples of how rapid and precise sequencing could transform identification of transmission of healthcare-associated infection and therefore improve hospital infection control and patient outcomes in routine clinical practice.
Whole-genome sequencing offers new insights into the evolution of bacterial pathogens and the etiology of bacterial disease. Staphylococcus aureus is a major cause of bacteria-associated mortality and invasive disease and is carried asymptomatically by 27% of adults. Eighty percent of bacteremias match the carried strain. However, the role of evolutionary change in the pathogen during the progression from carriage to disease is incompletely understood. Here we use high-throughput genome sequencing to discover the genetic changes that accompany the transition from nasal carriage to fatal bloodstream infection in an individual colonized with methicillin-sensitive S. aureus. We found a single, cohesive population exhibiting a repertoire of 30 single-nucleotide polymorphisms and four insertion/deletion variants. Mutations accumulated at a steady rate over a 13-mo period, except for a cluster of mutations preceding the transition to disease. Although bloodstream bacteria differed by just eight mutations from the original nasally carried bacteria, half of those mutations caused truncation of proteins, including a premature stop codon in an AraC-family transcriptional regulator that has been implicated in pathogenicity. Comparison with evolution in two asymptomatic carriers supported the conclusion that clusters of protein-truncating mutations are highly unusual. Our results demonstrate that bacterial diversity in vivo is limited but nonetheless detectable by whole-genome sequencing, enabling the study of evolutionary dynamics within the host. Regulatory or structural changes that occur during carriage may be functionally important for pathogenesis; therefore identifying those changes is a crucial step in understanding the biological causes of invasive bacterial disease.
Genetic factors have been implicated in stroke risk, but few replicated associations have been reported. We conducted a genome-wide association study (GWAS) for ischemic stroke and its subtypes in 3,548 affected individuals and 5,972 controls, all of European ancestry. Replication of potential signals was performed in 5,859 affected individuals and 6,281 controls. We replicated previous associations for cardioembolic stroke near PITX2 and ZFHX3 and for large vessel stroke at a 9p21 locus. We identified a new association for large vessel stroke within HDAC9 (encoding histone deacetylase 9) on chromosome 7p21.1 (including further replication in an additional 735 affected individuals and 28,583 controls) (rs11984041; combined P = 1.87 × 10 -11 ; odds ratio (OR) = 1.42, 95% confidence interval (CI) = 1.28-1.57). All four loci exhibited evidence for heterogeneity of effect across the stroke subtypes, with some and possibly all affecting risk for only one subtype. This suggests distinct genetic architectures for different stroke subtypes. © 2012 Nature America, Inc. All rights reserved.
There have been few definitive examples of gene-gene interactions in humans. Through mutational analyses in 7325 individuals, we report four interactions (defined as departures from a multiplicative model) between mutations in the breast cancer susceptibility genes ATM and CHEK2 with BRCA1 and BRCA2 (case-only interaction between ATM and BRCA1/BRCA2 combined, P = 5.9 × 10 -4 ; ATM and BRCA1, P= 0.01; ATM and BRCA2, P= 0.02; CHEK2 and BRCA1/BRCA2 combined, P = 2.1 × 10 =4 ; CHEK2 and BRCA1, P= 0.01; CHEK2 and BRCA2, P= 0.01). The interactions are such that the resultant risk of breast cancer is lower than the multiplicative product of the constituent risks, and plausibly reflect the functional relationships of the encoded proteins in DNA repair. These findings have important implications for models of disease predisposition and clinical translation. © The Author 2011. Published by Oxford University Press. All rights reserved.
Genome-wide association studies (GWAS) search for associations between genetic variants and disease status, typically via logistic regression. Often there are covariates, such as sex or well-established major genetic factors, that are known to affect disease susceptibility and are independent of tested genotypes at the population level. We show theoretically and with data from recent GWAS on multiple sclerosis, psoriasis and ankylosing spondylitis that inclusion of known covariates can substantially reduce power for the identification of associated variants when the disease prevalence is lower than a few percent. Whether the inclusion of such covariates reduces or increases power to detect genetic effects depends on various factors, including the prevalence of the disease studied. When the disease is common (prevalence of >20%), the inclusion of covariates typically increases power, whereas, for rarer diseases, it can often decrease power to detect new genetic associations.
Genome-wide association studies (GWASs) have been successful at identifying single-nucleotide polymorphisms (SNPs) highly associated with common traits; however, a great deal of the heritable variation associated with common traits remains unaccounted for within the genome. Genome-wide complex trait analysis (GCTA) is a statistical method that applies a linear mixed model to estimate phenotypic variance of complex traits explained by genome-wide SNPs, including those not associated with the trait in a GWAS. We applied GCTA to 8 cohorts containing 7096 case and 19 455 control individuals of European ancestry in order to examine the missing heritability present in Parkinson's disease (PD). We meta-analyzed our initial results to produce robust heritability estimates for PD types across cohorts. Our results identify 27% (95% CI 17-38, P = 8.08E - 08) phenotypic variance associated with all types of PD, 15% (95% CI -0.2 to 33, P = 0.09) phenotypic variance associated with early-onset PD and 31% (95% CI 17-44, P = 1.34E - 05) phenotypic variance associated with late-onset PD. This is a substantial increase from the genetic variance identified by top GWAS hits alone (between 3 and 5%) and indicates there are substantially more risk loci to be identified. Our results suggest that although GWASs are a useful tool in identifying the most common variants associated with complex disease, a great deal of common variants of small effect remain to be discovered.
By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.
To gain further insight into the genetic architecture of psoriasis, we conducted a meta-analysis of 3 genome-wide association studies (GWAS) and 2 independent data sets genotyped on the Immunochip, including 10,588 cases and 22,806 controls. We identified 15 new susceptibility loci, increasing to 36 the number associated with psoriasis in European individuals. We also identified, using conditional analyses, five independent signals within previously known loci. The newly identified loci shared with other autoimmune diseases include candidate genes with roles in regulating T-cell function (such as RUNX3, TAGAP and STAT3). Notably, they included candidate genes whose products are involved in innate host defense, including interferon-mediated antiviral responses (DDX58), macrophage activation (ZC3H12C) and nuclear factor (NF)-κB signaling (CARD14 and CARM1). These results portend a better understanding of shared and distinctive genetic determinants of immune-mediated inflammatory disorders and emphasize the importance of the skin in innate and acquired host defense. © 2012 Nature America, Inc. All rights reserved.
To further investigate susceptibility loci identified by genome-wide association studies, we genotyped 5,500 SNPs across 14 associated regions in 8,000 samples from a control group and 3 diseases: type 2 diabetes (T2D), coronary artery disease (CAD) and Graves' disease. We defined, using Bayes theorem, credible sets of SNPs that were 95% likely, based on posterior probability, to contain the causal disease-associated SNPs. In 3 of the 14 regions, TCF7L2 (T2D), CTLA4 (Graves' disease) and CDKN2A-CDKN2B (T2D), much of the posterior probability rested on a single SNP, and, in 4 other regions (CDKN2A-CDKN2B (CAD) and CDKAL1, FTO and HHEX (T2D)), the 95% sets were small, thereby excluding most SNPs as potentially causal. Very few SNPs in our credible sets had annotated functions, illustrating the limitations in understanding the mechanisms underlying susceptibility to common diseases. Our results also show the value of more detailed mapping to target sequences for functional studies. © 2012 Nature America, Inc. All rights reserved.
We have performed a metabolite quantitative trait locus (mQTL) study of the (1)H nuclear magnetic resonance spectroscopy ((1)H NMR) metabolome in humans, building on recent targeted knowledge of genetic drivers of metabolic regulation. Urine and plasma samples were collected from two cohorts of individuals of European descent, with one cohort comprised of female twins donating samples longitudinally. Sample metabolite concentrations were quantified by (1)H NMR and tested for association with genome-wide single-nucleotide polymorphisms (SNPs). Four metabolites' concentrations exhibited significant, replicable association with SNP variation (8.6×10(-11)<p<2.8×10(-23)). Three of these-trimethylamine, 3-amino-isobutyrate, and an N-acetylated compound-were measured in urine. The other-dimethylamine-was measured in plasma. Trimethylamine and dimethylamine mapped to a single genetic region (hence we report a total of three implicated genomic regions). Two of the three hit regions lie within haplotype blocks (at 2p13.1 and 10q24.2) that carry the genetic signature of strong, recent, positive selection in European populations. Genes NAT8 and PYROXD2, both with relatively uncharacterized functional roles, are good candidates for mediating the corresponding mQTL associations. The study's longitudinal twin design allowed detailed variance-components analysis of the sources of population variation in metabolite levels. The mQTLs explained 40%-64% of biological population variation in the corresponding metabolites' concentrations. These effect sizes are stronger than those reported in a recent, targeted mQTL study of metabolites in serum using the targeted-metabolomics Biocrates platform. By re-analysing our plasma samples using the Biocrates platform, we replicated the mQTL findings of the previous study and discovered a previously uncharacterized yet substantial familial component of variation in metabolite levels in addition to the heritability contribution from the corresponding mQTL effects.
There is a great deal of interest in a fine-scale population structure in the UK, both as a signature of historical immigration events and because of the effect population structure may have on disease association studies. Although population structure appears to have a minor impact on the current generation of genome-wide association studies, it is likely to have a significant part in the next generation of studies designed to search for rare variants. A powerful way of detecting such structure is to control and document carefully the provenance of the samples involved. In this study, we describe the collection of a cohort of rural UK samples (The People of the British Isles), aimed at providing a well-characterised UK-control population that can be used as a resource by the research community, as well as providing a fine-scale genetic information on the British population. So far, some 4000 samples have been collected, the majority of which fit the criteria of coming from a rural area and having all four grandparents from approximately the same area. Analysis of the first 3865 samples that have been geocoded indicates that 75% have a mean distance between grandparental places of birth of 37.3 km, and that about 70% of grandparental places of birth can be classed as rural. Preliminary genotyping of 1057 samples demonstrates the value of these samples for investigating a fine-scale population structure within the UK, and shows how this can be enhanced by the use of surnames.
Accurate assignment of copy number at known copy number variant (CNV) loci is important for both increasing understanding of the structural evolution of genomes as well as for carrying out association studies of copy number with disease. As with calling SNP genotypes, the task can be framed as a clustering problem but for a number of reasons assigning copy number is much more challenging. CNV assays have lower signal-to-noise ratios than SNP assays, often display heavy tailed and asymmetric intensity distributions, contain outlying observations and may exhibit systematic technical differences among different cohorts. In addition, the number of copy-number classes at a CNV in the population may be unknown a priori. Due to these complications, automatic and robust assignment of copy number from array data remains a challenging problem. We have developed a copy number assignment algorithm, CNVCALL, for a targeted CNV array, such as that used by the Wellcome Trust Case Control Consortium's recent CNV association study. We use a Bayesian hierarchical mixture model that robustly identifies both the number of different copy number classes at a specific locus as well as relative copy number for each individual in the sample. This approach is fully automated which is a critical requirement when analyzing large numbers of CNVs. We illustrate the methods performance using real data from the Wellcome Trust Case Control Consortium's CNV association study and using simulated data.
Salmonella enterica is a bacterial pathogen that causes enteric fever and gastroenteritis in humans and animals. Although its population structure was long described as clonal, based on high linkage disequilibrium between loci typed by enzyme electrophoresis, recent examination of gene sequences has revealed that recombination plays an important evolutionary role. We sequenced around 10% of the core genome of 114 isolates of enterica using a resequencing microarray. Application of two different analysis methods (Structure and ClonalFrame) to our genomic data allowed us to define five clear lineages within S. enterica subspecies enterica, one of which is five times older than the other four and two thirds of the age of the whole subspecies. We show that some of these lineages display more evidence of recombination than others. We also demonstrate that some level of sexual isolation exists between the lineages, so that recombination has occurred predominantly between members of the same lineage. This pattern of recombination is compatible with expectations from the previously described ecological structuring of the enterica population as well as mechanistic barriers to recombination observed in laboratory experiments. In spite of their relatively low level of genetic differentiation, these lineages might therefore represent incipient species.
MOTIVATION: Performing experiments with simulated data is an inexpensive approach to evaluating competing experimental designs and analysis methods in genome-wide association studies. Simulation based on resampling known haplotypes is fast and efficient and can produce samples with patterns of linkage disequilibrium (LD), which mimic those in real data. However, the inability of current methods to simulate multiple nearby disease SNPs on the same chromosome can limit their application. RESULTS: We introduce a new simulation algorithm based on a successful resampling method, HAPGEN, that can simulate multiple nearby disease SNPs on the same chromosome. The new method, HAPGEN2, retains many advantages of resampling methods and expands the range of disease models that current simulators offer. AVAILABILITY: HAPGEN2 is freely available from http://www.stats.ox.ac.uk/~marchini/software/gwas/gwas.html. CONTACT: firstname.lastname@example.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
PLOS GENETICS, 7 (6), pp. e1002142-e1002142. | Citations: 21 (Web of Science Lite) | Read more2011. A Two-Stage Meta-Analysis Identifies Several New Loci for Parkinson's Disease
Evaluating the likelihood function of parameters in highly-structured population genetic models from extant deoxyribonucleic acid (DNA) sequences is computationally prohibitive. In such cases, one may approximately infer the parameters from summary statistics of the data such as the site-frequency-spectrum (SFS) or its linear combinations. Such methods are known as approximate likelihood or Bayesian computations. Using a controlled lumped Markov chain and computational commutative algebraic methods, we compute the exact likelihood of the SFS and many classical linear combinations of it at a non-recombining locus that is neutrally evolving under the infinitely-many-sites mutation model. Using a partially ordered graph of coalescent experiments around the SFS, we provide a decision-theoretic framework for approximate sufficiency. We also extend a family of classical hypothesis tests of standard neutrality at a non-recombining locus based on the SFS to a more powerful version that conditions on the topological information provided by the SFS.
Most findings from genome-wide association studies (GWAS) are consistent with a simple disease model at a single nucleotide polymorphism, in which each additional copy of the risk allele increases risk by the same multiplicative factor, in contrast to dominance or interaction effects. As others have noted, departures from this multiplicative model are difficult to detect. Here, we seek to quantify this both analytically and empirically. We show that imperfect linkage disequilibrium (LD) between causal and marker loci distorts disease models, with the power to detect such departures dropping off very quickly: decaying as a function of r4, where r2 is the usual correlation between the causal and marker loci, in contrast to the well-known result that power to detect a multiplicative effect decays as a function of r2. We perform a simulation study with empirical patterns of LD to assess how this disease model distortion is likely to impact GWAS results. Among loci where association is detected, we observe that there is reasonable power to detect substantial deviations from the multiplicative model, such as for dominant and recessive models. Thus, it is worth explicitly testing for such deviations routinely.
Genome-wide association studies (GWAS) have identified hundreds of associated loci across many common diseases. Most risk variants identified by GWAS will merely be tags for as-yet-unknown causal variants. It is therefore possible that identification of the causal variant, by fine mapping, will identify alleles with larger effects on genetic risk than those currently estimated from GWAS replication studies. We show that under plausible assumptions, whilst the majority of the per-allele relative risks (RR) estimated from GWAS data will be close to the true risk at the causal variant, some could be considerable underestimates. For example, for an estimated RR in the range 1.2-1.3, there is approximately a 38% chance that it exceeds 1.4 and a 10% chance that it is over 2. We show how these probabilities can vary depending on the true effects associated with low-frequency variants and on the minor allele frequency (MAF) of the most associated SNP. We investigate the consequences of the underestimation of effect sizes for predictions of an individual's disease risk and interpret our results for the design of fine mapping experiments. Although these effects mean that the amount of heritability explained by known GWAS loci is expected to be larger than current projections, this increase is likely to explain a relatively small amount of the so-called "missing" heritability.
Science, 331 (6020), pp. 1024-1025. | Citations: 4 (Scopus) | Read more2011. Genome-sequencing anniversary. Making sense of the data.
SCIENCE, 331 (6020), pp. 1024-1025. | Citations: 4 (Web of Science Lite) | Read more2011. Making Sense of the Data
We performed a genome-wide association study (GWAS) in 1705 Parkinson's disease (PD) UK patients and 5175 UK controls, the largest sample size so far for a PD GWAS. Replication was attempted in an additional cohort of 1039 French PD cases and 1984 controls for the 27 regions showing the strongest evidence of association (P< 10(-4)). We replicated published associations in the 4q22/SNCA and 17q21/MAPT chromosome regions (P< 10(-10)) and found evidence for an additional independent association in 4q22/SNCA. A detailed analysis of the haplotype structure at 17q21 showed that there are three separate risk groups within this region. We found weak but consistent evidence of association for common variants located in three previously published associated regions (4p15/BST1, 4p16/GAK and 1q32/PARK16). We found no support for the previously reported SNP association in 12q12/LRRK2. We also found an association of the two SNPs in 4q22/SNCA with the age of onset of the disease.
Metformin is the most commonly used pharmacological therapy for type 2 diabetes. We report a genome-wide association study for glycemic response to metformin in 1,024 Scottish individuals with type 2 diabetes with replication in two cohorts including 1,783 Scottish individuals and 1,113 individuals from the UK Prospective Diabetes Study. In a combined meta-analysis, we identified a SNP, rs11212617, associated with treatment success (n = 3,920, P = 2.9 × 10(-9), odds ratio = 1.35, 95% CI 1.22-1.49) at a locus containing ATM, the ataxia telangiectasia mutated gene. In a rat hepatoma cell line, inhibition of ATM with KU-55933 attenuated the phosphorylation and activation of AMP-activated protein kinase in response to metformin. We conclude that ATM, a gene known to be involved in DNA repair and cell cycle control, plays a role in the effect of metformin upstream of AMP-activated protein kinase, and variation in this gene alters glycemic response to metformin.
Genome-wide association studies have identified 11 common variants convincingly associated with coronary artery disease (CAD)¹⁻⁷, a modest number considering the apparent heritability of CAD⁸. All of these variants have been discovered in European populations. We report a meta-analysis of four large genome-wide association studies of CAD, with ∼575,000 genotyped SNPs in a discovery dataset comprising 15,420 individuals with CAD (cases) (8,424 Europeans and 6,996 South Asians) and 15,062 controls. There was little evidence for ancestry-specific associations, supporting the use of combined analyses. Replication in an independent sample of 21,408 cases and 19,185 controls identified five loci newly associated with CAD (P < 5 × 10⁻⁸ in the combined discovery and replication analysis): LIPA on 10q23, PDGFD on 11q22, ADAMTS7-MORF4L1 on 15q25, a gene rich locus on 7q22 and KIAA1462 on 10p11. The CAD-associated SNP in the PDGFD locus showed tissue-specific cis expression quantitative trait locus effects. These findings implicate new pathways for CAD susceptibility.
Ankylosing spondylitis is a common form of inflammatory arthritis predominantly affecting the spine and pelvis that occurs in approximately 5 out of 1,000 adults of European descent. Here we report the identification of three variants in the RUNX3, LTBR-TNFRSF1A and IL12B regions convincingly associated with ankylosing spondylitis (P < 5 × 10(-8) in the combined discovery and replication datasets) and a further four loci at PTGER4, TBKBP1, ANTXR2 and CARD9 that show strong association across all our datasets (P < 5 × 10(-6) overall, with support in each of the three datasets studied). We also show that polymorphisms of ERAP1, which encodes an endoplasmic reticulum aminopeptidase involved in peptide trimming before HLA class I presentation, only affect ankylosing spondylitis risk in HLA-B27-positive individuals. These findings provide strong evidence that HLA-B27 operates in ankylosing spondylitis through a mechanism involving aberrant processing of antigenic peptides.
¹H Nuclear Magnetic Resonance spectroscopy (¹H NMR) is increasingly used to measure metabolite concentrations in sets of biological samples for top-down systems biology and molecular epidemiology. For such purposes, knowledge of the sources of human variation in metabolite concentrations is valuable, but currently sparse. We conducted and analysed a study to create such a resource. In our unique design, identical and non-identical twin pairs donated plasma and urine samples longitudinally. We acquired ¹H NMR spectra on the samples, and statistically decomposed variation in metabolite concentration into familial (genetic and common-environmental), individual-environmental, and longitudinally unstable components. We estimate that stable variation, comprising familial and individual-environmental factors, accounts on average for 60% (plasma) and 47% (urine) of biological variation in ¹H NMR-detectable metabolite concentrations. Clinically predictive metabolic variation is likely nested within this stable component, so our results have implications for the effective design of biomarker-discovery studies. We provide a power-calculation method which reveals that sample sizes of a few thousand should offer sufficient statistical precision to detect ¹H NMR-based biomarkers quantifying predisposition to disease.
Multiple sclerosis is a common disease of the central nervous system in which the interplay between inflammatory and neurodegenerative processes typically results in intermittent neurological disturbance followed by progressive accumulation of disability. Epidemiological studies have shown that genetic factors are primarily responsible for the substantially increased frequency of the disease seen in the relatives of affected individuals, and systematic attempts to identify linkage in multiplex families have confirmed that variation within the major histocompatibility complex (MHC) exerts the greatest individual effect on risk. Modestly powered genome-wide association studies (GWAS) have enabled more than 20 additional risk loci to be identified and have shown that multiple variants exerting modest individual effects have a key role in disease susceptibility. Most of the genetic architecture underlying susceptibility to the disease remains to be defined and is anticipated to require the analysis of sample sizes that are beyond the numbers currently available to individual research groups. In a collaborative GWAS involving 9,772 cases of European descent collected by 23 research groups working in 15 different countries, we have replicated almost all of the previously suggested associations and identified at least a further 29 novel susceptibility loci. Within the MHC we have refined the identity of the HLA-DRB1 risk alleles and confirmed that variation in the HLA-A gene underlies the independent protective effect attributable to the class I region. Immunologically relevant genes are significantly overrepresented among those mapping close to the identified loci and particularly implicate T-helper-cell differentiation in the pathogenesis of multiple sclerosis.
Accurate assignment of copy number at known copy number variant (CNV) loci is important for both increasing understanding of the structural evolution of genomes as well as for carrying out association studies of copy number with disease. As with calling SNP genotypes, the task can be framed as a clustering problem but for a number of reasons assigning copy number is much more challenging. CNV assays have lower signal-to-noise ratios than SNP assays, often display heavy tailed and asymmetric intensity distributions, contain outlying observations and may exhibit systematic technical differences among different cohorts. In addition, the number of copy-number classes at a CNV in the population may be unknown a priori. Due to these complications, automatic and robust assignment of copy number from array data remains a challenging problem. We have developed a copy number assignment algorithm, CNVCALL, for a targeted CNV array, such as that used by the Wellcome Trust Case Control Consortium's recent CNV association study. We use a Bayesian hierarchical mixture model that robustly identifies both the number of different copy number classes at a specific locus as well as relative copy number for each individual in the sample. This approach is fully automated which is a critical requirement when analyzing large numbers of CNVs. We illustrate the methods performance using real data from the Wellcome Trust Case Control Consortium's CNV association study and using simulated data. © 2011 Wiley-Liss, Inc.
1 H Nuclear Magnetic Resonance spectroscopy (1 H NMR) is increasingly used to measure metabolite concentrations in sets of biological samples for top-down systems biology and molecular epidemiology. For such purposes, knowledge of the sources of human variation in metabolite concentrations is valuable, but currently sparse. We conducted and analysed a study to create such a resource. In our unique design, identical and non-identical twin pairs donated plasma and urine samples longitudinally. We acquired 1 H NMR spectra on the samples, and statistically decomposed variation in metabolite concentration into familial (genetic and common-environmental), individual-environmental, and longitudinally unstable components. We estimate t hat stable variation, comprising familial and individual-environmental factors, accounts on average for 60% (plasma) and 47% (urine) of biological variation in 1 H NMR-detectable metabolite concentrations. Clinically predictive metabolic variation is likely nested within this stable component, so our results have implications for the effective design of biomarker-discovery studies. We provide a power-calculation method which reveals that sample sizes of a few thousand should offer sufficient statistical precision to detect 1 H NMR-based biomarkers quantifying predisposition to disease. © 2011 EMBO and Macmillan Publishers Limited All rights reserved.
OBJECTIVE: Proinsulin is a precursor of mature insulin and C-peptide. Higher circulating proinsulin levels are associated with impaired β-cell function, raised glucose levels, insulin resistance, and type 2 diabetes (T2D). Studies of the insulin processing pathway could provide new insights about T2D pathophysiology. RESEARCH DESIGN AND METHODS: We have conducted a meta-analysis of genome-wide association tests of ∼2.5 million genotyped or imputed single nucleotide polymorphisms (SNPs) and fasting proinsulin levels in 10,701 nondiabetic adults of European ancestry, with follow-up of 23 loci in up to 16,378 individuals, using additive genetic models adjusted for age, sex, fasting insulin, and study-specific covariates. RESULTS: Nine SNPs at eight loci were associated with proinsulin levels (P < 5 × 10(-8)). Two loci (LARP6 and SGSM2) have not been previously related to metabolic traits, one (MADD) has been associated with fasting glucose, one (PCSK1) has been implicated in obesity, and four (TCF7L2, SLC30A8, VPS13C/C2CD4A/B, and ARAP1, formerly CENTD2) increase T2D risk. The proinsulin-raising allele of ARAP1 was associated with a lower fasting glucose (P = 1.7 × 10(-4)), improved β-cell function (P = 1.1 × 10(-5)), and lower risk of T2D (odds ratio 0.88; P = 7.8 × 10(-6)). Notably, PCSK1 encodes the protein prohormone convertase 1/3, the first enzyme in the insulin processing pathway. A genotype score composed of the nine proinsulin-raising alleles was not associated with coronary disease in two large case-control datasets. CONCLUSIONS: We have identified nine genetic variants associated with fasting proinsulin. Our findings illuminate the biology underlying glucose homeostasis and T2D development in humans and argue against a direct role of proinsulin in coronary artery disease pathogenesis.
We examined the role of common genetic variation in schizophrenia in a genome-wide association study of substantial size: a stage 1 discovery sample of 21,856 individuals of European ancestry and a stage 2 replication sample of 29,839 independent subjects. The combined stage 1 and 2 analysis yielded genome-wide significant associations with schizophrenia for seven loci, five of which are new (1p21.3, 2q32.3, 8p23.2, 8q21.3 and 10q24.32-q24.33) and two of which have been previously implicated (6p21.32-p22.1 and 18q21.2). The strongest new finding (P = 1.6 × 10(-11)) was with rs1625579 within an intron of a putative primary transcript for MIR137 (microRNA 137), a known regulator of neuronal development. Four other schizophrenia loci achieving genome-wide significance contain predicted targets of MIR137, suggesting MIR137-mediated dysregulation as a previously unknown etiologic mechanism in schizophrenia. In a joint analysis with a bipolar disorder sample (16,374 affected individuals and 14,044 controls), three loci reached genome-wide significance: CACNA1C (rs4765905, P = 7.0 × 10(-9)), ANK3 (rs10994359, P = 2.5 × 10(-8)) and the ITIH3-ITIH4 region (rs2239547, P = 7.8 × 10(-9)).
A rise in [Ca(2+)](i) provides the trigger for neurotransmitter release at neuronal boutons. We have used confocal microscopy and Ca(2+) sensitive dyes to directly measure the action potential-evoked [Ca(2+)](i) in the boutons of Schaffer collaterals. This reveals that the trial-by-trial amplitude of the evoked Ca(2+) transient is bimodally distributed. We demonstrate that "large" Ca(2+) transients occur when presynaptic NMDA receptors are activated following transmitter release. Presynaptic NMDA receptor activation proves critical in producing facilitation of transmission at theta frequencies. Because large Ca(2+) transients "report" transmitter release, their frequency on a trial-by-trial basis can be used to estimate the probability of release, p(r). We use this novel estimator to show that p(r) increases following the induction of long-term potentiation.
To identify new susceptibility loci for psoriasis, we undertook a genome-wide association study of 594,224 SNPs in 2,622 individuals with psoriasis and 5,667 controls. We identified associations at eight previously unreported genomic loci. Seven loci harbored genes with recognized immune functions (IL28RA, REL, IFIH1, ERAP1, TRAF3IP2, NFKBIA and TYK2). These associations were replicated in 9,079 European samples (six loci with a combined P < 5 × 10⁻⁸ and two loci with a combined P < 5 × 10⁻⁷). We also report compelling evidence for an interaction between the HLA-C and ERAP1 loci (combined P = 6.95 × 10⁻⁶). ERAP1 plays an important role in MHC class I peptide processing. ERAP1 variants only influenced psoriasis susceptibility in individuals carrying the HLA-C risk allele. Our findings implicate pathways that integrate epidermal barrier dysfunction with innate and adaptive immune dysregulation in psoriasis pathogenesis.
The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.
Copy number variants (CNVs) account for a major proportion of human genetic polymorphism and have been predicted to have an important role in genetic susceptibility to common disease. To address this we undertook a large, direct genome-wide study of association between CNVs and eight common human diseases. Using a purpose-designed array we typed approximately 19,000 individuals into distinct copy-number classes at 3,432 polymorphic CNVs, including an estimated approximately 50% of all common CNVs larger than 500 base pairs. We identified several biological artefacts that lead to false-positive associations, including systematic CNV differences between DNAs derived from blood and cell lines. Association testing and follow-up replication analyses confirmed three loci where CNVs were associated with disease-IRGM for Crohn's disease, HLA for Crohn's disease, rheumatoid arthritis and type 1 diabetes, and TSPAN8 for type 2 diabetes-although in each case the locus had previously been identified in single nucleotide polymorphism (SNP)-based studies, reflecting our observation that most common CNVs that are well-typed on our array are well tagged by SNPs and so have been indirectly explored through SNP studies. We conclude that common CNVs that can be typed on existing platforms are unlikely to contribute greatly to the genetic basis of common human diseases.
PLOS GENETICS, 6 (4), | Citations: 2 (Web of Science Lite) | Read more2010. Is Mate Choice in Humans MHC-Dependent?
Although present in both humans and chimpanzees, recombination hotspots, at which meiotic crossover events cluster, differ markedly in their genomic location between the species. We report that a 13-base pair sequence motif previously associated with the activity of 40% of human hotspots does not function in chimpanzees and is being removed by self-destructive drive in the human lineage. Multiple lines of evidence suggest that the rapidly evolving zinc-finger protein PRDM9 binds to this motif and that sequence changes in the protein may be responsible for hotspot differences between species. The involvement of PRDM9, which causes histone H3 lysine 4 trimethylation, implies that there is a common mechanism for recombination hotspots in eukaryotes but raises questions about what forces have driven such rapid change.
Despite compelling evidence for a major genetic contribution to risk of bipolar mood disorder, conclusive evidence implicating specific genes or pathophysiological systems has proved elusive. In part this is likely to be related to the unknown validity of current phenotype definitions and consequent aetiological heterogeneity of samples. In the recent Wellcome Trust Case Control Consortium genome-wide association analysis of bipolar disorder (1868 cases, 2938 controls) one of the most strongly associated polymorphisms lay within the gene encoding the GABA(A) receptor beta1 subunit, GABRB1. Aiming to increase biological homogeneity, we sought the diagnostic subset that showed the strongest signal at this polymorphism and used this to test for independent evidence of association with other members of the GABA(A) receptor gene family. The index signal was significantly enriched in the 279 cases meeting Research Diagnostic Criteria for schizoaffective disorder, bipolar type (P=3.8 x 10(-6)). Independently, these cases showed strong evidence that variation in GABA(A) receptor genes influences risk for this phenotype (independent system-wide P=6.6 x 10(-5)) with association signals also at GABRA4, GABRB3, GABRA5 and GABRR3. [corrected] Our findings have the potential to inform understanding of presentation, pathogenesis and nosology of bipolar disorders. Our method of phenotype refinement may be useful in studies of other complex psychiatric and non-psychiatric disorders.
Bulletin of Mathematical Biology, pp. 1-44.2010. Experiments with the Site Frequency Spectrum
NATURE GENETICS, 43 (2), pp. 117-U57. | Citations: 171 (Web of Science Lite) | Read more2011. Common variants near ATM are associated with glycemic response to metformin in type 2 diabetes
Ulcerative colitis is a common form of inflammatory bowel disease with a complex etiology. As part of the Wellcome Trust Case Control Consortium 2, we performed a genome-wide association scan for ulcerative colitis in 2,361 cases and 5,417 controls. Loci showing evidence of association at P < 1 x 10(-5) were followed up by genotyping in an independent set of 2,321 cases and 4,818 controls. We find genome-wide significant evidence of association at three new loci, each containing at least one biologically relevant candidate gene, on chromosomes 20q13 (HNF4A; P = 3.2 x 10(-17)), 16q22 (CDH1 and CDH3; P = 2.8 x 10(-8)) and 7q31 (LAMB1; P = 3.0 x 10(-8)). Of note, CDH1 has recently been associated with susceptibility to colorectal cancer, an established complication of longstanding ulcerative colitis. The new associations suggest that changes in the integrity of the intestinal epithelial barrier may contribute to the pathogenesis of ulcerative colitis.
The standard paradigm for the analysis of genome-wide association studies involves carrying out association tests at both typed and imputed SNPs. These methods will not be optimal for detecting the signal of association at SNPs that are not currently known or in regions where allelic heterogeneity occurs. We propose a novel association test, complementary to the SNP-based approaches, that attempts to extract further signals of association by explicitly modeling and estimating both unknown SNPs and allelic heterogeneity at a locus. At each site we estimate the genealogy of the case-control sample by taking advantage of the HapMap haplotypes across the genome. Allelic heterogeneity is modeled by allowing more than one mutation on the branches of the genealogy. Our use of Bayesian methods allows us to assess directly the evidence for a causative SNP not well correlated with known SNPs and for allelic heterogeneity at each locus. Using simulated data and real data from the WTCCC project, we show that our method (i) produces a significant boost in signal and accurately identifies the form of the allelic heterogeneity in regions where it is known to exist, (ii) can suggest new signals that are not found by testing typed or imputed SNPs and (iii) can provide more accurate estimates of effect sizes in regions of association.
Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%-20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.
We report a genome-wide association (GWA) study of severe malaria in The Gambia. The initial GWA scan included 2,500 children genotyped on the Affymetrix 500K GeneChip, and a replication study included 3,400 children. We used this to examine the performance of GWA methods in Africa. We found considerable population stratification, and also that signals of association at known malaria resistance loci were greatly attenuated owing to weak linkage disequilibrium (LD). To investigate possible solutions to the problem of low LD, we focused on the HbS locus, sequencing this region of the genome in 62 Gambian individuals and then using these data to conduct multipoint imputation in the GWA samples. This increased the signal of association, from P = 4 × 10(-7) to P = 4 × 10(-14), with the peak of the signal located precisely at the HbS causal variant. Our findings provide proof of principle that fine-resolution multipoint imputation, based on population-specific sequencing data, can substantially boost authentic GWA signals and enable fine mapping of causal variants in African populations.
Genome-wide association studies are revolutionizing the search for the genes underlying human complex diseases. The main decisions to be made at the design stage of these studies are the choice of the commercial genotyping chip to be used and the numbers of case and control samples to be genotyped. The most common method of comparing different chips is using a measure of coverage, but this fails to properly account for the effects of sample size, the genetic model of the disease, and linkage disequilibrium between SNPs. In this paper, we argue that the statistical power to detect a causative variant should be the major criterion in study design. Because of the complicated pattern of linkage disequilibrium (LD) in the human genome, power cannot be calculated analytically and must instead be assessed by simulation. We describe in detail a method of simulating case-control samples at a set of linked SNPs that replicates the patterns of LD in human populations, and we used it to assess power for a comprehensive set of available genotyping chips. Our results allow us to compare the performance of the chips to detect variants with different effect sizes and allele frequencies, look at how power changes with sample size in different populations or when using multi-marker tags and genotype imputation approaches, and how performance compares to a hypothetical chip that contains every SNP in HapMap. A main conclusion of this study is that marked differences in genome coverage may not translate into appreciable differences in power and that, when taking budgetary considerations into account, the most powerful design may not always correspond to the chip with the highest coverage. We also show that genotype imputation can be used to boost the power of many chips up to the level obtained from a hypothetical "complete" chip containing all the SNPs in HapMap. Our results have been encapsulated into an R software package that allows users to design future association studies and our methods provide a framework with which new chip sets can be evaluated.
After more than a decade of hope and hype, researchers are finally making inroads into understanding the genetic basis of many common human diseases. The use of genome-wide association studies has broken the logjam, enabling genetic variants at specific loci to be associated with particular diseases. Genetic association data are now providing new routes to understanding the aetiology of disease, as well as new footholds on the long and difficult path to better treatment and disease prevention.
In humans, most meiotic crossover events are clustered into short regions of the genome known as recombination hot spots. We have previously identified DNA motifs that are enriched in hot spots, particularly the 7-mer CCTCCCT. Here we use the increased hot-spot resolution afforded by the Phase 2 HapMap and novel search methods to identify an extended family of motifs based around the degenerate 13-mer CCNCCNTNNCCNC, which is critical in recruiting crossover events to at least 40% of all human hot spots and which operates on diverse genetic backgrounds in both sexes. Furthermore, these motifs are found in hypervariable minisatellites and are clustered in the breakpoint regions of both disease-causing nonallelic homologous recombination hot spots and common mitochondrial deletion hot spots, implicating the motif as a driver of genome instability.
Genetic variation at classical HLA alleles is a crucial determinant of transplant success and susceptibility to a large number of infectious and autoimmune diseases. However, large-scale studies involving classical type I and type II HLA alleles might be limited by the cost of allele-typing technologies. Although recent studies have shown that some common HLA alleles can be tagged with small numbers of markers, SNP-based tagging does not offer a complete solution to predicting HLA alleles. We have developed a new statistical methodology to use SNP variation within the region to predict alleles at key class I (HLA-A, HLA-B, and HLA-C) and class II (HLA-DRB1, HLA-DQA1, and HLA-DQB1) loci. Our results indicate that a single panel of approximately 100 SNPs typed across the region is sufficient for predicting both rare and common HLA alleles with up to 95% accuracy in both African and non-African populations. Furthermore, we show that HLA alleles can be successfully predicted by using previously genotyped SNPs that are within the MHC and that had not been chosen for their ability to predict HLA alleles, such as those included on genome-wide products. These results indicate that our methodology, combined with an extended database of reference haplotypes, will facilitate large-scale experiments, including disease-association studies and vaccine trials, in which detailed information about HLA type is valuable.
In several species, including rodents and fish, it has been shown that the Major Histocompatibility Complex (MHC) influences mating preferences and, in some cases, that this may be mediated by preferences based on body odour. In humans, the picture has been less clear. Several studies have reported a tendency for humans to prefer MHC-dissimilar mates, a sexual selection that would favour the production of MHC-heterozygous offspring, who would be more resistant to pathogens, but these results are unsupported by other studies. Here, we report analyses of genome-wide genotype data (from the HapMap II dataset) and HLA types in African and European American couples to test whether humans tend to choose MHC-dissimilar mates. In order to distinguish MHC-specific effects from genome-wide effects, the pattern of similarity in the MHC region is compared to the pattern in the rest of the genome. African spouses show no significant pattern of similarity/dissimilarity across the MHC region (relatedness coefficient, R = 0.015, p = 0.23), whereas across the genome, they are more similar than random pairs of individuals (genome-wide R = 0.00185, p<10(-3)). We discuss several explanations for these observations, including demographic effects. On the other hand, the sampled European American couples are significantly more MHC-dissimilar than random pairs of individuals (R = -0.043, p = 0.015), and this pattern of dissimilarity is extreme when compared to the rest of the genome, both globally (genome-wide R = -0.00016, p = 0.739) and when broken into windows having the same length and recombination rate as the MHC (only nine genomic regions exhibit a higher level of genetic dissimilarity between spouses than does the MHC). This study thus supports the hypothesis that the MHC influences mate choice in some human populations.
We have genotyped 14,436 nonsynonymous SNPs (nsSNPs) and 897 major histocompatibility complex (MHC) tag SNPs from 1,000 independent cases of ankylosing spondylitis (AS), autoimmune thyroid disease (AITD), multiple sclerosis (MS) and breast cancer (BC). Comparing these data against a common control dataset derived from 1,500 randomly selected healthy British individuals, we report initial association and independent replication in a North American sample of two new loci related to ankylosing spondylitis, ARTS1 and IL23R, and confirmation of the previously reported association of AITD with TSHR and FCRL3. These findings, enabled in part by increased statistical power resulting from the expansion of the control reference group to include individuals from the other disease groups, highlight notable new possibilities for autoimmune regulation and suggest that IL23R may be a common susceptibility factor for the major 'seronegative' diseases.
We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25-35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10-30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations.
We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25-35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r 2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r 2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10-30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations. ©2007 Nature Publishing Group.
The Genetic Association Information Network (GAIN) is a public-private partnership established to investigate the genetic basis of common diseases through a series of collaborative genome-wide association studies. GAIN has used new approaches for project selection, data deposition and distribution, collaborative analysis, publication and protection from premature intellectual property claims. These demonstrate a new commitment to shared scientific knowledge that should facilitate rapid advances in understanding the genetics of complex diseases.
Genome-wide association studies are set to become the method of choice for uncovering the genetic basis of human diseases. A central challenge in this area is the development of powerful multipoint methods that can detect causal variants that have not been directly genotyped. We propose a coherent analysis framework that treats the problem as one involving missing or uncertain genotypes. Central to our approach is a model-based imputation method for inferring genotypes at observed or unobserved SNPs, leading to improved power over existing methods for multipoint association mapping. Using real genome-wide association study data, we show that our approach (i) is accurate and well calibrated, (ii) provides detailed views of associated regions that facilitate follow-up studies and (iii) can be used to validate and correct data at genotyped markers. A notable future use of our method will be to boost power by combining data from genome-wide scans that use different SNP sets.
There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to the identification of genes involved in common human diseases. We describe a joint GWA study (using the Affymetrix GeneChip 500K Mapping Array Set) undertaken in the British population, which has examined approximately 2,000 individuals for each of 7 major diseases and a shared set of approximately 3,000 controls. Case-control comparisons identified 24 independent association signals at P < 5 x 10(-7): 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn's disease, 3 in rheumatoid arthritis, 7 in type 1 diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of these signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found compelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified a large number of further signals (including 58 loci with single-point P values between 10(-5) and 5 x 10(-7)) likely to yield additional susceptibility loci. The importance of appropriately large samples was confirmed by the modest effect sizes observed at most loci identified. This study thus represents a thorough validation of the GWA approach. It has also demonstrated that careful use of a shared control group represents a safe and effective approach to GWA analyses of multiple disease phenotypes; has generated a genome-wide genotype database for future studies of common diseases in the British population; and shown that, provided individuals with non-European ancestry are excluded, the extent of population stratification in the British population is generally modest. Our findings offer new avenues for exploring the pathophysiology of these important disorders. We anticipate that our data, results and software, which will be widely available to other investigators, will provide a powerful resource for human genetics research.
Nature, 447 (7145), pp. 655-660. | Citations: 973 (Scopus) | Read more2007. Replicating genotype-phenotype associations.
Genome-wide association studies are still constrained by the cost of genotyping. For this reason, the selection of a reduced set of markers or tags able to capture a significant proportion of the genetic variation is an important aspect of these studies. Most tagging SNP selection methods have been successful in capturing the genetic variation of the data from which the tags have been chosen. However, when these tags are used in an independent data set, a significant proportion of the remaining SNPs (non-tags) are not captured and, in most cases, there is no information on which SNPs are captured. We propose to use a probabilistic model to predict the non-tags based on a set of tags, as a way to capture genetic variation. An important advantage of this method is that it directly predicts the genotype of the non-tags with which we can test for association with the phenotype and which could help to elucidate the location of genes responsible for increasing disease susceptibility. Additionally, this method provides an estimate of the probabilities with which the predictions are made, which reflects the confidence of the probabilistic model. We also propose new methods to select the tagging SNPs. We empirically show by using HapMap data that our approach is able to capture significantly more genetic variation than methods based solely on a pairwise LD measure.
With the advent of dense maps of human genetic variation, it is now possible to detect positive natural selection across the human genome. Here we report an analysis of over 3 million polymorphisms from the International HapMap Project Phase 2 (HapMap2). We used 'long-range haplotype' methods, which were developed to identify alleles segregating in a population that have undergone recent selection, and we also developed new methods that are based on cross-population comparisons to discover alleles that have swept to near-fixation within a population. The analysis reveals more than 300 strong candidate regions. Focusing on the strongest 22 regions, we develop a heuristic for scrutinizing these regions to identify candidate targets of selection. In a complementary analysis, we identify 26 non-synonymous, coding, single nucleotide polymorphisms showing regional evidence of positive selection. Examination of these candidates highlights three cases in which two genes in a common biological process have apparently undergone positive selection in the same population:LARGE and DMD, both related to infection by the Lassa virus, in West Africa;SLC24A5 and SLC45A2, both involved in skin pigmentation, in Europe; and EDAR and EDA2R, both involved in development of hair follicles, in Asia.
In the case of Regina v. Adams , DNA evidence seemed to suggest that there was a 1 in 200 million chance that an innocent person would match the DNA found at the crime scene. Peter Donnelly explains how he subsequently became involved in the case and found himself trying to explain Bayes' theorem to judge and jury. (See p.18 for an explanation).
The major histocompatibility complex (MHC) on chromosome 6 is associated with susceptibility to more common diseases than any other region of the human genome, including almost all disorders classified as autoimmune. In type 1 diabetes the major genetic susceptibility determinants have been mapped to the MHC class II genes HLA-DQB1 and HLA-DRB1 (refs 1-3), but these genes cannot completely explain the association between type 1 diabetes and the MHC region. Owing to the region's extreme gene density, the multiplicity of disease-associated alleles, strong associations between alleles, limited genotyping capability, and inadequate statistical approaches and sample sizes, which, and how many, loci within the MHC determine susceptibility remains unclear. Here, in several large type 1 diabetes data sets, we analyse a combined total of 1,729 polymorphisms, and apply statistical methods-recursive partitioning and regression-to pinpoint disease susceptibility to the MHC class I genes HLA-B and HLA-A (risk ratios >1.5; P(combined) = 2.01 x 10(-19) and 2.35 x 10(-13), respectively) in addition to the established associations of the MHC class II genes. Other loci with smaller and/or rarer effects might also be involved, but to find these, future searches must take into account both the HLA class II and class I genes and use even larger samples. Taken together with previous studies, we conclude that MHC-class-I-mediated events, principally involving HLA-B*39, contribute to the aetiology of type 1 diabetes.
In humans, the rate of recombination, as measured on the megabase scale, is positively associated with the level of genetic variation, as measured at the genic scale. Despite considerable debate, it is not clear whether these factors are causally linked or, if they are, whether this is driven by the repeated action of adaptive evolution or molecular processes such as double-strand break formation and mismatch repair. We introduce three innovations to the analysis of recombination and diversity: fine-scale genetic maps estimated from genotype experiments that identify recombination hotspots at the kilobase scale, analysis of an entire human chromosome, and the use of wavelet techniques to identify correlations acting at different scales. We show that recombination influences genetic diversity only at the level of recombination hotspots. Hotspots are also associated with local increases in GC content and the relative frequency of GC-increasing mutations but have no effect on substitution rates. Broad-scale association between recombination and diversity is explained through covariance of both factors with base composition. To our knowledge, these results are the first evidence of a direct and local influence of recombination hotspots on genetic variation and the fate of individual mutations. However, that hotspots have no influence on substitution rates suggests that they are too ephemeral on an evolutionary time scale to have a strong influence on broader scale patterns of base composition and long-term molecular evolution.
Using the statistical analysis of genetic variation, we have developed a high-resolution genetic map of recombination hotspots and recombination rate variation across the human genome. This map, which has a resolution several orders of magnitude greater than previous studies, identifies over 25,000 recombination hotspots and gives new insights into the distribution and determination of recombination. Wavelet-based analysis demonstrates scale-specific influences of base composition, coding context and DNA repeats on recombination rates, though, in contrast with other species, no association with DNase I hypersensitivity. We have also identified specific DNA motifs that are strongly associated with recombination hotspots and whose activity is influenced by local context. Comparative analysis of recombination rates in humans and chimpanzees demonstrates very high rates of evolution of the fine-scale structure of the recombination landscape. In the light of these observations, we suggest possible resolutions of the hotspot paradox.
Knowledge of haplotype phase is valuable for many analysis methods in the study of disease, population, and evolutionary genetics. Considerable research effort has been devoted to the development of statistical and computational methods that infer haplotype phase from genotype data. Although a substantial number of such methods have been developed, they have focused principally on inference from unrelated individuals, and comparisons between methods have been rather limited. Here, we describe the extension of five leading algorithms for phase inference for handling father-mother-child trios. We performed a comprehensive assessment of the methods applied to both trios and to unrelated individuals, with a focus on genomic-scale problems, using both simulated data and data from the HapMap project. The most accurate algorithm was PHASE (v2.1). For this method, the percentages of genotypes whose phase was incorrectly inferred were 0.12%, 0.05%, and 0.16% for trios from simulated data, HapMap Centre d'Etude du Polymorphisme Humain (CEPH) trios, and HapMap Yoruban trios, respectively, and 5.2% and 5.9% for unrelated individuals in simulated data and the HapMap CEPH data, respectively. The other methods considered in this work had comparable but slightly worse error rates. The error rates for trios are similar to the levels of genotyping error and missing data expected. We thus conclude that all the methods considered will provide highly accurate estimates of haplotypes when applied to trio data sets. Running times differ substantially between methods. Although it is one of the slowest methods, PHASE (v2.1) was used to infer haplotypes for the 1 million-SNP HapMap data set. Finally, we evaluated methods of estimating the value of r(2) between a pair of SNPs and concluded that all methods estimated r(2) well when the estimated value was >or=0.8.
Following our recent report of high levels of recombination and geographic structuring amongst isolates from two populations, we have investigated global patterns of herpes simplex virus type 1 (HSV-1) molecular diversity using population samples from six countries in Europe, Asia and Africa. Sequence comparisons show that HSV-1 from Kenya is both highly diverse and distinct from either European or Asian HSV-1. HSV-1 populations are much more highly differentiated than human populations at the same geographic scales, with 35% of total variation at the level of inter-population comparisons, a difference likely to be due to higher rates of both mutation and genetic drift in HSV-1 than in equivalent human data. There is substantial differentiation between northwestern European HSV-1 populations and those from East Asia, and while patterns of British and Swedish HSV-1 variation were indistinguishable, differentiation was detectable amongst Chinese, Korean and Japanese HSV-1 samples, in spite of their lower overall diversity. The program Structure was used to reconstruct ancestral Eurasian lineages, which we estimated to have originated approximately 60,000 years ago. A specific pattern detected amongst East Asian HSV-1 isolates is currently best explained by the two waves of migration responsible for the peopling of Japan.
In humans, the rate of recombination, as measured on the megabase scale, is positively associated with the level of genetic variation, as measured at the genic scale. Despite considerable debate, it is not clear whether these factors are causally linked or, if they are, whether this is driven by the repeated action of adaptive evolution or molecular processes such as double-strand break formation and mismatch repair. We introduce three innovations to the analysis of recombination and diversity: fine-scale genetic maps estimated from genotype experiments that identify recombination hotspots at the kilobase scale, analysis of an entire human chromosome, and the use of wavelet techniques to identify correlations acting at different scales. We show that recombination influences genetic diversity only at the level of recombination hotspots. Hotspots are also associated with local increases in GC content and the relative frequency of GC-increasing mutations but have no effect on substitution rates. Broad-scale association between recombination and diversity is explained through covariance of both factors with base composition. To our knowledge, these results are the first evidence of a direct and local influence of recombination hotspots on genetic variation and the fate of individual mutations. However, that hotspots have no influence on substitution rates suggests that they are too ephemeral on an evolutionary time scale to have a strong influence on broader scale patterns of base composition and long-term molecular evolution.
International Congress of Mathematicians, ICM 2006, 3 pp. 559-574. | Citations: 1 (Scopus) | Show Abstract2006. Modelling genes: Mathematical and statistical challenges in genomics
The completion of the human and other genome projects, and the ongoing development of high-throughput experimental methods for measuring genetic variation, have dramatically changed the scale of information available and the nature of the questions which can now be asked in modern biomedical genetics. Although there is a long history of mathematical modelling in genetics, these developments offer exciting new opportunities and challenges for the mathematical sciences. We focus here on the challenges within human population genetics, in which data document molecular genetic variation between different people. The explosion of data on human variation allows us to study aspects of the underlying evolutionary processes and the molecular mechanisms behind them; the patterns of genetic variation in different geographical regions and the ancestral histories of human populations; and the genetic basis of common human diseases. In each case, sophisticated mathematical, statistical, and computational tools are needed to unravel much of the information in the data, with many of the best methods combining complex stochastic modelling and modern computationally-intensive statistical methods. But the rewards are great: key pieces of scientific knowledge simply would not have been available by other means. © 2006 European Mathematical Society.
Inherited genetic variation has a critical but as yet largely uncharacterized role in human disease. Here we report a public database of common variation in the human genome: more than one million single nucleotide polymorphisms (SNPs) for which accurate and complete genotypes have been obtained in 269 DNA samples from four populations, including ten 500-kilobase regions in which essentially all information about common DNA variation has been extracted. These data document the generality of recombination hotspots, a block-like structure of linkage disequilibrium and low haplotype diversity, leading to substantial correlations of SNPs with many of their neighbours. We show how the HapMap resource can guide the design and analysis of genetic association studies, shed light on structural variation and recombination, and identify loci that may have been subject to natural selection during human evolution.
Genetic maps, which document the way in which recombination rates vary over a genome, are an essential tool for many genetic analyses. We present a high-resolution genetic map of the human genome, based on statistical analyses of genetic variation data, and identify more than 25,000 recombination hotspots, together with motifs and sequence contexts that play a role in hotspot activity. Differences between the behavior of recombination rates over large (megabase) and small (kilobase) scales lead us to suggest a two-stage model for recombination in which hotspots are stochastic features, within a framework in which large-scale rates are constrained.
The fine-scale distribution of meiotic recombination events in the human genome can be inferred from patterns of haplotype diversity in human populations but directly studied only by high-resolution sperm typing. Both approaches indicate that crossovers are heavily clustered into narrow recombination hot spots. But our direct understanding of hot-spot properties and distributions is largely limited to sperm typing in the major histocompatibility complex (MHC). We now describe the analysis of an unremarkable 206-kb region on human chromosome 1, which identified localized regions of linkage disequilibrium breakdown that mark the locations of sperm crossover hot spots. The distribution, intensity and morphology of these hot spots are markedly similar to those in the MHC. But we also accidentally detected additional hot spots in regions of strong association. Coalescent analysis of genotype data detected most of the hot spots but showed significant differences between sperm crossover frequencies and historical recombination rates. This raises the possibility that some hot spots, particularly those in regions of strong association, may have evolved very recently and not left their full imprint on haplotype diversity. These results suggest that hot spots could be very abundant and possibly fluid features of the human genome.
After nearly 10 years of intense academic and commercial research effort, large genome-wide association studies for common complex diseases are now imminent. Although these conditions involve a complex relationship between genotype and phenotype, including interactions between unlinked loci, the prevailing strategies for analysis of such studies focus on the locus-by-locus paradigm. Here we consider analytical methods that explicitly look for statistical interactions between loci. We show first that they are computationally feasible, even for studies of hundreds of thousands of loci, and second that even with a conservative correction for multiple testing, they can be more powerful than traditional analyses under a range of models for interlocus interactions. We also show that plausible variations across populations in allele frequencies among interacting loci can markedly affect the power to detect their marginal effects, which may account in part for the well-known difficulties in replicating association results. These results suggest that searching for interactions among genetic loci can be fruitfully incorporated into analysis strategies for genome-wide association studies.
We compared fine-scale recombination rates at orthologous loci in humans and chimpanzees by analyzing polymorphism data in both species. Strong statistical evidence for hotspots of recombination was obtained in both species. Despite approximately 99% identity at the level of DNA sequence, however, recombination hotspots were found rarely (if at all) at the same positions in the two species, and no correlation was observed in estimates of fine-scale recombination rates. Thus, local patterns of recombination rate have evolved rapidly, in a manner disproportionate to the change in DNA sequence.
In the case of Regina versus Adams , DNA evidence seemed to suggest that there was a 1 in 200 million chance that an innocent person would match the DNA found at the crime scene. Peter Donnelly explains how he subsequently became involved in the case and found himself trying to explain Bayes's Theorem to judge and jury.
NATURE GENETICS, 36 (11), pp. 1131-1131.2004. Genomic control to the extreme - Reply
There has been considerable recent interest in understanding the way in which recombination rates vary over small physical distances, and the extent of recombination hotspots, in various genomes. Here we adapt, apply, and assess the power of recently developed coalescent-based approaches to estimating recombination rates from sequence polymorphism data. We apply full-likelihood estimation to study rate variation in and around a well-characterized recombination hotspot in humans, in the beta-globin gene cluster, and show that it provides similar estimates, consistent with those from sperm studies, from two populations deliberately chosen to have different demographic and selectional histories. We also demonstrate how approximate-likelihood methods can be used to detect local recombination hotspots from genomic-scale SNP data. In a simulation study based on 80 100-kb regions, these methods detect 43 out of 60 hotspots (ranging from 1 to 2 kb in size), with only two false positives out of 2000 subregions that were tested for the presence of a hotspot. Our study suggests that new computational tools for sophisticated analysis of population diversity data are valuable for hotspot detection and fine-scale mapping of local recombination rates.
Herpes simplex virus type 1 (HSV-1) is highly prevalent in all human populations and has been presumed to evolve in a clonal manner because of a lack of evidence for significant levels of co-infection. Different HSV-1 populations have distinct distributions of strains and the long timescale evident from HSV-1 population diversity has led to the suggestion that studies of virus variability may yield information about host population history. In this sequencing study of three segments of the HSV-1 genome in population samples from the UK and Korea, evidence of recombination was widespread both at the level of reassortment between widely separated loci and within shorter contiguous sequences and the estimated rate of recombination was comparable to that of mutation. Since recombination requires the coexistence of two viral genomes, these results suggest that co-infection by genetically distinct strains may be a more important aspect of HSV-1 epidemiology than previously realized. With its capacity to make new combinations of variants available for selection, substantial recombination requires a radically revised model for the rate and mode of evolution of the virus.
Large-scale association studies hold substantial promise for unraveling the genetic basis of common human diseases. A well-known problem with such studies is the presence of undetected population structure, which can lead to both false positive results and failures to detect genuine associations. Here we examine approximately 15,000 genome-wide single-nucleotide polymorphisms typed in three population groups to assess the consequences of population structure on the coming generation of association studies. The consequences of population structure on association outcomes increase markedly with sample size. For the size of study needed to detect typical genetic effects in common diseases, even the modest levels of population structure within population groups cannot safely be ignored. We also examine one method for correcting for population structure (Genomic Control). Although it often performs well, it may not correct for structure if too few loci are used and may overcorrect in other settings, leading to substantial loss of power. The results of our analysis can guide the design of large-scale association studies.
The nature and scale of recombination rate variation are largely unknown for most species. In humans, pedigree analysis has documented variation at the chromosomal level, and sperm studies have identified specific hotspots in which crossing-over events cluster. To address whether this picture is representative of the genome as a whole, we have developed and validated a method for estimating recombination rates from patterns of genetic variation. From extensive single-nucleotide polymorphism surveys in European and African populations, we find evidence for extreme local rate variation spanning four orders in magnitude, in which 50% of all recombination events take place in less than 10% of the sequence. We demonstrate that recombination hotspots are a ubiquitous feature of the human genome, occurring on average every 200 kilobases or less, but recombination occurs preferentially outside genes.
A new algorithm is presented for exact simulation from the conditional distribution of the genealogical history of a sample, given the composition of the sample, for population genetics models with general diploid selection. The method applies to the usual diffusion approximation of evolution at a single locus, in a randomly mating population of constant size, for mutation models in which the distribution of the type of a mutant does not depend on the type of the progenitor allele; this includes any model with only two alleles. The new method is applied to ancestral inference for the two-allele case, both with genic selection and heterozygote advantage and disadvantage, where one of the alleles is assumed to have resulted from a unique mutation event. The paper describes how the method could be used for inference when data are also available at neutral markers linked to the locus under selection. It also informally describes and constructs the non-neutral Fleming-Viot measure-valued diffusion.
In this report, we compare and contrast three previously published Bayesian methods for inferring haplotypes from genotype data in a population sample. We review the methods, emphasizing the differences between them in terms of both the models ("priors") they use and the computational strategies they employ. We introduce a new algorithm that combines the modeling strategy of one method with the computational strategies of another. In comparisons using real and simulated data, this new algorithm outperforms all three existing methods. The new algorithm is included in the software package PHASE, version 2.0, available online (http://www.stat.washington.edu/stephens/software.html).
There has been some controversy in the literature concerning whether Icelanders are genetically homogenous or heterogeneous relative to other European populations. We reassess this question in the light of large data sets spanning 83 autosomal SNP loci, 14 serogenetic loci, 6622 Y-chromosomes and 3214 sequences from mtDNA hypervariable segments 1 and 2 (HVS1 and HVS2). Our results strongly support the hypothesis that genetic drift, with a consequent loss of variation, has had a greater impact on Icelanders than most other Europeans. We also analyse 7245 HVS1 sequences from 25 European populations. In line with other studies, we observe a deficit of rare HVS1 haplotypes and an excess of intermediate frequency haplotypes in Icelanders compared to most European populations, with some measures of genetic diversity indicating relative heterogeneity and others indicating relative homogeneity of Icelanders. Simulations indicate that genetic drift, and not admixture (as proposed by Arnason, 2003) is the most likely cause of the atypical Icelandic HVS1 frequency spectrum. These simulations reveal that gene diversity (heterozygosity) and mean pairwise differences are largely insensitive to events in recent population history, while statistics based on the number of haplotypes or segregating sites are much more sensitive. Overall, our analyses strongly indicate that the Icelandic gene pool is less heterogeneous than those of most other European populations.
The goal of the International HapMap Project is to determine the common patterns of DNA sequence variation in the human genome and to make this information freely available in the public domain. An international consortium is developing a map of these patterns across the genome by determining the genotypes of one million or more sequence variants, their frequencies and the degree of association between them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe. The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance our ability to choose targets for therapeutic intervention. © 2003 Nature Publishing Group.
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 64 (4), pp. 737-775. | Citations: 6 (Web of Science Lite) | Read more2002. Discussion on the meeting on 'Statistical modelling and analysis of genetic data'
Genetics, 159 (3), pp. 1299-1318. | Citations: 204 (Scopus) | Show Abstract2001. Estimating recombination rates from population genetic data.
We introduce a new method for estimating recombination rates from population genetic data. The method uses a computationally intensive statistical procedure (importance sampling) to calculate the likelihood under a coalescent-based model. Detailed comparisons of the new algorithm with two existing methods (the importance sampling method of Griffiths and Marjoram and the MCMC method of Kuhner and colleagues) show it to be substantially more efficient. (The improvement over the existing importance sampling scheme is typically by four orders of magnitude.) The existing approaches not infrequently led to misleading results on the problems we investigated. We also performed a simulation study to look at the properties of the maximum-likelihood estimator of the recombination rate and its robustness to misspecification of the demographic model.
Case-control tests for association are an important tool for mapping complex-trait genes. But population structure can invalidate this approach, leading to apparent associations at markers that are unlinked to disease loci. Family-based tests of association can avoid this problem, but such studies are often more expensive and in some cases--particularly for late-onset diseases--are impractical. In this review article we describe a series of approaches published over the past 2 years which use multilocus genotype data to enable valid case-control tests of association, even in the presence of population structure. These tests can be classified into two categories. "Genomic control" methods use the independent marker loci to adjust the distribution of a standard test statistic, while "structured association" methods infer the details of population structure en route to testing for association. We discuss the statistical issues involved in the different approaches and present results from simulations comparing the relative performance of the methods under a range of models.
Genetics, 159 (2), pp. 853-867. | Citations: 24 (Web of Science Lite) | Show Abstract2001. Likelihoods and simulation methods for a class of nonneutral population genetics models.
Methods for simulating samples and sample statistics, under mutation-selection-drift equilibrium for a class of nonneutral population genetics models, and for evaluating the likelihood surface, in selection and mutation parameters, are developed and applied for observed data. The methods apply to large populations in settings in which selection is weak, in the sense that selection intensities, like mutation rates, are of the order of the inverse of the population size. General diploid selection is allowed, but the approach is currently restricted to models, such as the infinite alleles model and certain K-models, in which the type of a mutant allele does not depend on the type of its progenitor allele. The simulation methods have considerable advantages over available alternatives. No other methods currently seem practicable for approximating likelihood surfaces.
Current routine genotyping methods typically do not provide haplotype information, which is essential for many analyses of fine-scale molecular-genetics data. Haplotypes can be obtained, at considerable cost, experimentally or (partially) through genotyping of additional family members. Alternatively, a statistical method can be used to infer phase and to reconstruct haplotypes. We present a new statistical method, applicable to genotype data at linked loci from a population sample, that improves substantially on current algorithms; often, error rates are reduced by > 50%, relative to its nearest competitor. Furthermore, our algorithm performs well in absolute terms, suggesting that reconstructing haplotypes experimentally or by genotyping additional family members may be an inefficient use of resources.
Consider a population of fixed size consisting of N haploid individuals. Assume that this population evolves according to the two-allele neutral Moran model in mathematical genetics. Denote the two alleles by A 1 and A 2 . Allow mutation from one type to another and let 0 < γ < 1 be the sum of mutation probabilities. All the information about the population is recorded by the Markov chain X = (X(t)) t > 0 which counts the number of individuals of type A 1 . In this paper we study the time taken for the population to 'reach' stationarity (in the sense of separation and total variation distances) when initially all individuals are of one type. We show that after t* = N γ -1 log N + c N the separation distance between the law of X(t*) and its stationary distribution converges to 1 - exp(-γe -γc ) as N → ∞. For the total variation distance an asymptotic upper bound is obtained. The results depend on a particular duality, and couplings, between X and a genealogical process known as the lines of descent process.
The use, in association studies, of the forthcoming dense genomewide collection of single-nucleotide polymorphisms (SNPs) has been heralded as a potential breakthrough in the study of the genetic basis of common complex disorders. A serious problem with association mapping is that population structure can lead to spurious associations between a candidate marker and a phenotype. One common solution has been to abandon case-control studies in favor of family-based tests of association, such as the transmission/disequilibrium test (TDT), but this comes at a considerable cost in the need to collect DNA from close relatives of affected individuals. In this article we describe a novel, statistically valid, method for case-control association studies in structured populations. Our method uses a set of unlinked genetic markers to infer details of population structure, and to estimate the ancestry of sampled individuals, before using this information to test for associations within subpopulations. It provides power comparable with the TDT in many settings and may substantially outperform it if there are conflicting associations in different subpopulations.
ANNALS OF PROBABILITY, 28 (3), pp. 1063-1110. | Citations: 12 (Scopus) | Show Abstract2000. Continuum-sites stepping-stone models, coalescing exchangeable partitions and random trees
Analogues of stepping--stone models are considered where the site--space is continuous, the migration process is a general Markov process, and the type--space is infinite. Such processes were defined in previous work of the second author by specifying a Feller transition semigroup in terms of expectations of suitable functionals for systems of coalescing Markov processes. An alternative representation is obtained here in terms of a limit of interacting particle systems. It is shown that, under a mild condition on the migration process, the continuum--sites stepping--stone process has continuous sample paths. The case when the migration process is Brownian motion on the circle is examined in detail using a duality relation between coalescing and annihilating Brownian motion. This duality relation is also used to show that a random compact metric space that is naturally associated to an infinite family of coalescing Brownian motions on the circle has Hausdorff and packing dimension both almost surely equal to 1/2 and, moreover, this space is capacity equivalent to the middle--1/2 Cantor set (and hence also to the Brownian zero set).
Genetics, 155 (2), pp. 945-959. | Citations: 15244 (Scopus) | Show Abstract2000. Inference of population structure using multilocus genotype data.
We describe a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations. We assume a model in which there are K populations (where K may be unknown), each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more populations if their genotypes indicate that they are admixed. Our model does not assume a particular mutation process, and it can be applied to most of the commonly used genetic markers, provided that they are not closely linked. Applications of our method include demonstrating the presence of population structure, assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individuals. We show that the method can produce highly accurate assignments using modest numbers of loci-e.g. , seven microsatellite loci in an example using genotype data from an endangered bird species. The software used for this article is available from http://www.stats.ox.ac.uk/ approximately pritch/home. html.
The mutation rate of the mitochondrial control region has been widely used to calibrate human population history. However, estimates of the mutation rate in this region have spanned two orders of magnitude. To readdress this rate, we sequenced the mtDNA control region in 272 individuals, who were related by a total of 705 mtDNA transmission events, from 26 large Icelandic pedigrees. Three base substitutions were observed, and the mutation rate across the two hypervariable regions was estimated to be 3/705 =.0043 per generation (95% confidence interval [CI].00088-.013), or.32/site/1 million years (95% CI.065-.97). This study is substantially larger than others published, which have directly assessed mtDNA mutation rates on the basis of pedigrees, and the estimated mutation rate is intermediate among those derived from pedigree-based studies. Our estimated rate remains higher than those based on phylogenetic comparisons. We discuss possible reasons for-and consequences of-this discrepancy. The present study also provides information on rates of insertion/deletion mutations, rates of heteroplasmy, and the reliability of maternal links in the Icelandic genealogy database.
Full likelihood-based inference for modern population genetics data presents methodological and computational challenges. The problem is of considerable practical importance and has attracted recent attention, with the development of algorithms based on importance sampling (IS) and Markov chain Monte Carlo (MCMC) sampling. Here we introduce a new IS algorithm. The optimal proposal distribution for these problems can be characterized, and we exploit a detailed analysis of genealogical processes to develop a practicable approximation to it. We compare the new method with existing algorithms on a variety of genetic examples. Our approach substantially outperforms existing IS algorithms, with efficiency typically improved by several orders of magnitude. The new method also compares favourably with existing MCMC methods in some problems, and less favourably in others, suggesting that both IS and MCMC methods have a continuing role to play in this area. We offer insights into the relative advantages of each approach, and we discuss diagnostics in the IS framework.
Genetics, 154 (4), pp. 1793-1807. | Citations: 32 (Scopus) | Show Abstract2000. Microsatellite mutations and inferences about human demography.
Microsatellites have been widely used as tools for population studies. However, inference about population processes relies on the specification of mutation parameters that are largely unknown and likely to differ across loci. Here, we use data on somatic mutations to investigate the mutation process at 14 tetranucleotide repeats and carry out an advanced multilocus analysis of different demographic scenarios on worldwide population samples. We use a method based on less restrictive assumptions about the mutation process, which is more powerful to detect departures from the null hypothesis of constant population size than other methods previously applied to similar data sets. We detect a signal of population expansion in all samples examined, except for one African sample. As part of this analysis, we identify an "anomalous" locus whose extreme pattern of variation cannot be explained by variability in mutation size. Exaggerated mutation rate is proposed as a possible cause for its unusual variation pattern. We evaluate the effect of using it to infer population histories and show that inferences about demographic histories are markedly affected by its inclusion. In fact, exclusion of the anomalous locus reduces interlocus variability of statistics summarizing population variation and strengthens the evidence in favor of demographic growth.
ADVANCES IN APPLIED PROBABILITY, 31 (4), pp. 1027-1028. | Citations: 7 (Web of Science Lite)1999. Discussion: Recent common ancestors of all present-day individuals
ANNALS OF APPLIED PROBABILITY, 9 (4), pp. 1091-1148. | Citations: 71 (Scopus) | Show Abstract1999. Genealogical processes for Fleming-Viot models with selection and recombination
Infinite population genetic models with general type space incorporating mutation, selection and recombination are considered. The Fleming-Viot measure-valued diffusion is represented in terms of a countably infinite-dimensional process. The complete genealogy of the population at each time can be recovered from the model. Results are given concerning the existence of stationary distributions and ergodicity and absolute continuity of the stationary distribution for a model with selection with respect to the stationary distribution for the corresponding neutral model.
This paper is concerned with the structure of the genealogy of a sample in which it is observed that some subset of chromosomes carries a particular mutation, assumed to have arisen uniquely in the history of the population. A rigorous theoretical study of this conditional genealogy is given using coalescent methods. Particular results include the mean, variance, and density of the age of the mutation conditional on its frequency in the sample. Most of the development relates to populations of constant size, but we discuss the extension to populations which have grown exponentially to their present size.
NATURE, 397 (6714), pp. 32-32. | Citations: 7 (Web of Science Lite) | Read more1999. The Thomas Jefferson paternity case - Reply
ANNALS OF PROBABILITY, 27 (1), pp. 166-205. | Citations: 114 (Scopus) | Show Abstract1999. Particle representations for measure-valued population models
Models of populations in which a type or location, represented by a point in a metric space E, is associated with each individual in the population are considered. A population process is neutral if the chances of an individual replicating or dying do not depend on its type. Measure-valued processes are obtained as infinite population limits for a large class of neutral population models, and it is shown that these measure-valued processes can be represented in terms of the total mass of the population and the de Finetti measures associated with an E∞-valued particle model X = (X 1 , X 2 ,...) such that, for each t ≥ 0, (X 1 Ct), X 2 (t), . . . ) is exchangeable. The construction gives an explicit connection between genealogical and diffusion models in population genetics. The class of measure-valued models covered includes both neutral Fleming-Viot and Dawson-Watanabe processes. The particle model gives a simple representation of the Dawson-Perkins historical process and Perkins's historical stochastic integral can be obtained in terms of classical semimartingale integration. A number of applications to new and known results on conditioning, uniqueness and limiting behavior are described.
Nature, 396 (6706), pp. 27-28. | Citations: 171 (Scopus) | Read more1998. Jefferson fathered slave's last child.
Genetics, 148 (3), pp. 1269-1284. | Citations: 130 (Scopus) | Show Abstract1998. Heterogeneity of microsatellite mutations within and between loci, and implications for human demographic histories.
Microsatellites have been widely used to reconstruct human evolution. However, the efficient use of these markers relies on information regarding the process producing the observed variation. Here, we present a novel approach to the locus-by-locus characterization of this process. By analyzing somatic mutations in cancer patients, we estimated the distributions of mutation size for each of 20 loci. The same loci were then typed in three ethnically diverse population samples. The generalized stepwise mutation model was used to test the predicted relationship between population and mutation parameters under two demographic scenarios: constant population size and rapid expansion. The agreement between the observed and expected relationship between population and mutation parameters, even when the latter are estimated in cancer patients, confirms that somatic mutations may be useful for investigating the process underlying population variation. Estimated distributions of mutation size differ substantially amongst loci, and mutations of more than one repeat unit are common. A new statistic, the normalized population variance, is introduced for multilocus estimation of demographic parameters, and for testing demographic scenarios. The observed population variation is not consistent with a constant population size. Time estimates of the putative population expansion are in agreement with those obtained by other methods.
Genetics, 146 (3), pp. 1185-1195. | Citations: 97 (Scopus) | Show Abstract1997. The coalescent process with selfing.
A method of estimating the selfing rate using DNA sequence data was recently proposed by Milligan. Unfortunately, a number of errors make interpretation of his results problematic. In the present paper we first show how the usual coalescent process can be adapted to models that include selfing, and then use this result to find moment estimators as well as the likelihood surface for the selfing rate, s, and the scaled mutation rate, theta. We conclude that, regardless of the method used, large sample sizes are necessary to estimate s with any degree of certainty, and that the estimate is always highly sensitive to recent changes in the true value.
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 160 pp. 460-461.1997. Bayesian analysis of DNA profiling data in forensic identification applications - Discussion
Mich Law Rev, 97 (4), pp. 931-984. | Citations: 31 (Scopus) | Read more1997. DNA database searches and the legal consumption of scientific evidence.
Genetics, 145 (2), pp. 505-518. | Citations: 363 (Scopus) | Show Abstract1997. Inferring coalescence times from DNA sequence data.
The paper is concerned with methods for the estimation of the coalescence time (time since the most recent common ancestor) of a sample of intraspecies DNA sequences. The methods take advantage of prior knowledge of population demography, in addition to the molecular data. While some theoretical results are presented, a central focus is on computational methods. These methods are easy to implement, and, since explicit formulae tend to be either unavailable or unilluminating, they are also more useful and more informative in most applications. Extensions are presented that allow for the effects of uncertainty in our knowledge of population size and mutation rates, for variability in population sizes, for regions of different mutation rate, and for inference concerning the coalescence time of the entire population. The methods are illustrated using recent data from the human Y chromosome.
A Polya-like urn arises in studying stationary distributions and stationary sampling distributions in neutral (Fleming-Viot) genetics models with bounded mutation rates. This paper gives a detailed analysis of asymptotic properties of the urn. In particular, it is shown that in a sample of size n, the maximum number of mutations along any lineage from the common ancestor grows extremely slowly with n. Kesten's result on the growth rate of the number of types when the mutation process is simple symmetric random walk (the Ohta-Kimura model) follows similarly.
Genetics, 144 (3), pp. 1247-1262. | Citations: 104 (Web of Science Lite) | Show Abstract1996. Optimal sequencing strategies for surveying molecular genetic diversity.
Two commonly used measures of genetic diversity for intraspecies DNA sequence data are based, respectively, on the number of segregating sites, and on the average number of pairwise nucleotide differences. Expressions are derived for their variance in the presence of intragenic recombination for a panmictic population of fixed size that is at neutral equilibrium at the region sequenced. We show that, in contrast to the slow decrease in variance with increasing sample size, if the recombination rate is nonzero, the asymptotic rate of decrease of variance with increasing sequence length, for fixed sample size, is quite rapid. In particular, it is close to that which would be obtained by sequencing independent chromosome regions. The correlation between measures of diversity from linked regions is also examined. For a given total number of bases sequenced in a particular region, optimal sequencing strategies are derived. These typically involve sequencing relatively few (three to 10) long copies of the region. Under optimal strategies, the variances of the two measures are very similar for most parameter values considered. Results concerning optimal sequencing strategies will be sensitive to gross departures from the underlying assumptions, such as population bottlenecks, selective sweeps, and substantial population substructure.
J Forensic Sci, 41 (4), pp. 603-607. | Citations: 40 (Scopus) | Show Abstract1996. Evaluating DNA profile evidence when the suspect is identified through a database search.
The paper is concerned with the strength of DNA evidence when a suspect is identified via a search through a database of the DNA profiles of known individuals. Consideration of the appropriate likelihood ratio shows that in this setting the DNA evidence is (slightly) stronger than when a suspect is identified by other means, subsequently profiled, and found to match. The recommendation of the 1992 report of the US National Research Council that DNA evidence that is used to identify the suspect should not be presented at trial thus seems unnecessarily conservative. The widely held view that DNA evidence is weaker when it results from a database search seems to be based on a rationale that leads to absurd conclusions in some examples. Moreover, this view is inconsistent with the principle, which enjoys substantial support, that evidential weight should be measured by likelihood ratios. The strength of DNA evidence is shown also to be slightly increased for other forms of search procedure. While the DNA evidence is stronger after a database search, the overall case against the suspect may not be, and the problems of incorporating the DNA with the non-DNA evidence can be particularly important in such cases.
Science, 272 (5266), pp. 1357-1359. | Citations: 32 (Web of Science Lite) | Read more1996. Estimating the age of the common ancestor of men from the ZFY intron.
ANNALS OF PROBABILITY, 24 (2), pp. 698-742. | Citations: 71 (Scopus) | Show Abstract1996. A countable representation of the Fleming-Viot measure-valued diffusion
The Fleming-Viot measure-valued diffusion arises as the infinite population limit of various discrete genetic models with general type space. The paper gives a countable construction of the process as the empirical measure carried by a certain interactive particle system. This explicit representation facilitates the study of various properties of the Fleming-Viot process. The construction also carries versions of the familiar genealogical processes from population genetics, in particular, Kingman's coalescent, thus unifying the genealogical and measure-valued approaches to the subject.
The exclusion process is an interacting particle system in which particles perform random walks on a lattice except that they may not move to a position already occupied. In this paper we show how techniques derived from quantum mechanics may be used to achieve asymptotic results for an exclusion process on a complete graph. In particular, an abrupt approach to stationarity is demonstrated. The similarity between the transition matrix for the exclusion process and the Hamiltonian for the Heisenberg ferromagnet is not well understood. However, it allows not only quantum operator techniques to be carried over to a problem in stochastic processes, but also concepts such as the "mean field".
Ciba Found Symp, 197 (197), pp. 25-40. | Citations: 18 (Scopus) | Show Abstract1996. Interpreting genetic variability: the effects of shared evolutionary history.
Data from different individuals at a single locus are positively correlated because of the shared genealogy of the sampled genes. This paper illustrates the qualitative effects on genealogical trees of assumptions about population demography, and it considers the consequences for genetic variability. An understanding of these effects is invaluable in the interpretation of data and for inferences about population history. In contrast, traditional genetic measures of diversity and approximation methods do not seem well suited for addressing the problem.
The controversy over the interpretation of DNA profile evidence in forensic identification can be attributed in part to confusion over the mode(s) of statistical inference appropriate to this setting. Although there has been substantial discussion in the literature of, for example, the role of population genetics issues, few authors have made explicit the inferential framework which underpins their arguments. This lack of clarity has led both to unnecessary debates over ill-posed or inappropriate questions and to the neglect of some issues which can have important consequences. We argue that the mode of statistical inference which seems to underlie the arguments of some authors, based on a hypothesis testing framework, is not appropriate for forensic identification. We propose instead a logically coherent framework in which, for example, the roles both of the population genetics issues and of the nonscientific evidence in a case are incorporated. Our analysis highlights several widely held misconceptions in the DNA profiling debate. For example, the profile frequency is not directly relevant to forensic inference. Further, very small match probabilities may in some settings be consistent with acquittal. Although DNA evidence is typically very strong, our analysis of the coherent approach highlights situations which can arise in practice where alternative methods for assessing DNA evidence may be misleading.
In comparing a particular DNA profile with that from an unknown (but distinct) individual, matches at different loci between the profiles will not be independent, even in a randomly mating population, because of the presence in the population of relatives of the individuals. The paper contains a theoretical analysis of the extent of this effect on the match probability, for profiling techniques which separately probe different loci. Naive calculation using the product rule could substantially understate the match probability. Past a certain point, the testing of additional loci provides no more information than would be available in discriminating between sibs. The correlation effect described here would be unimportant in criminal casework if close relatives of the suspect, and in particular full-sibs, were excluded as possible culprits. In the absence of such exclusions the current practice of effectively ignoring such relatives in presenting match probabilities could be extremely prejudicial to a suspect, even in cases in which there is no direct evidence to incriminate his/her relatives.
The paper considers aspects of the match probability calculation for multi-locus DNA profiles and a related calculation which aims to assess the probability that a pair of profiles is concordant for the presence and absence of bands. It is suggested that levels of allelism and linkage for multi-locus profiles may be higher than reported in previous studies, and that comparison of bandsharing values between different studies is problematic. Our view is that the independence assumptions which underpin the calculations have not been established. The effect of ignoring (local) heterogeneities in band frequencies may be non-conservative. Concerns thus raised about the match probability calculation could be important in practical casework. The speculative nature of some aspects of the concordance probability calculation would seem to render it inappropriate for use in court.
Genealogical or coalescent methods have proved very useful in interpreting and understanding a wide range of population genetic data. Our aim is to illustrate some of the central ideas behind this approach. The primary focus is genealogy in neutral genetic models, for which the effects of demography can be separated from those of mutation. We describe the coalescent for panmictic populations of fixed size, and its extensions to incorporate various assumptions about variation in population size and nonrandom mating caused by geographical population subdivision. The effects of such genealogical structure on patterns and correlations in genetic data are discussed. An urn model is useful for simulating samples at loci with complex mutation mechanisms. We give two applications of the genealogical approach. The first concerns methods for estimating the mutation rate from infinitely-many-sites data, and the second relates to inference about recent common ancestors and population history.
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 158 (1), pp. 21-53. | Citations: 64 (Web of Science Lite) | Read more1995. INFERENCE IN FORENSIC IDENTIFICATION
This paper is concerned with the approximation of early stages of epidemic processes by branching processes. A general model for an epidemic in a closed, homogeneously mixing population is presented. A construction of a sequence of such epidemics, indexed by the initial number of susceptibles N, from the limiting branching process is described. Strong convergence of the epidemic processes to the branching process is shown when the latter goes extinct. When the branching process does not go extinct, necessary and sufficient conditions on the sequence (t N ) for strong convergence over the time interval [0, t N ] are provided. Convergence of a wide variety of functionals of the epidemic process to corresponding functionals of the branching process is shown, and bounds are provided on the total variation distance for given N. The theory is illustrated by reference to the general stochastic epidemic. Generalisations to, for example, multipopulation epidemics are described briefly. © 1995.
CRIMINAL LAW REVIEW, pp. 711-721. | Citations: 26 (Web of Science Lite)1994. THE PROSECUTORS FALLACY AND DNA EVIDENCE
ADVANCES IN APPLIED PROBABILITY, 26 (3), pp. 715-727. | Citations: 9 (Web of Science Lite) | Read more1994. APPROACH TO STATIONARITY OF THE BERNOULLI-LAPLACE DIFFUSION-MODEL
Although some of the initial controversy surrounding DNA profiling has been resolved, courts have been misled about the strength of DNA evidence. © 1994 Nature Publishing Group.
Genetics, 136 (2), pp. 673-683. | Citations: 167 (Scopus) | Show Abstract1994. Pairwise comparisons of mitochondrial DNA sequences in subdivided populations and implications for early human evolution.
We consider the effect on the distribution of pairwise differences between mitochondrial DNA sequences of the incorporation into the underlying population genetics model of two particular effects that seem realistic for human populations. The first is that the population size was roughly constant before growing to its current level. The second is that the population is geographically subdivided rather than panmictic. In each case these features tend to encourage multimodal distributions of pairwise differences, in contrast to existing, unimodal datasets. We argue that population genetics models currently used to analyze such data may thus fail to reflect important features of human mitochondrial DNA evolution. These may include selection on the mitochondrial genome, more realistic mutation mechanisms, or special population or migration dynamics. Particularly in view of the variability inherent in the single available human mitochondrial genealogy, it is argued that until these effects are better understood, inferences from such data should be rather cautious.
For a general Markov SIS epidemic model, the fates of individuals at different times are shown to be positively correlated. When the population is subjected to two diseases, a certain condition, here called positive interference, results in positive correlations between individuals with respect to either disease, while another condition, called competition, gives negative correlation between diseases and positive correlation within each disease. The results generalize to two classes of disease, with positive interference within each class and competition between classes. A general (non-Markov) SIR model (which includes the general epidemic and generalized Reed-Frost models) exhibits positive correlation. The results for SIS models rely heavily on monotonicity properties and in some cases on a careful choice of partial order. For the SIR models a graphical construction of the models is used.
Consideration is given to the stochastic problem of the coagulation of particles for the case of a size-independent coagulation kernel, and expressions are derived for the expectation value, variance and covariance of the cluster size distribution function, for both a discrete and a continuous spectrum of cluster sizes. The authors develop an asymptotic expansion in V -1 of these quantities (where V is the spatial volume), showing that as V to infinity the above expectation value tends to the deterministic result, and obtaining an explicit form for the first-order deviation from this expression for large (but finite) V. Analogous results are derived for the variance and covariance in the limit of large V. A discussion is given of the extent to which stochastic effects can produce significant changes to the deterministic results.
JOURNAL OF THE LONDON MATHEMATICAL SOCIETY-SECOND SERIES, 47 (3), pp. 395-404. | Citations: 24 (Web of Science Lite) | Read more1993. ON THE ASYMPTOTIC-DISTRIBUTION OF LARGE PRIME FACTORS
JOURNAL OF APPLIED PROBABILITY, 30 (2), pp. 275-284. | Citations: 4 (Web of Science Lite) | Read more1993. CORRELATION AND VARIABILITY IN BIRTH PROCESSES
ADVANCES IN APPLIED PROBABILITY, 25 (1), pp. 255-260. | Citations: 10 (Web of Science Lite) | Read more1993. ON CONDITIONAL INTENSITIES AND ON INTERPARTICLE CORRELATION IN NONLINEAR DEATH PROCESSES
Analysis of several large collections of food webs has shown that predator-prey ratios tend to be roughly constant at values close to one. The constancy of the ratio may simply be an arithmetical artifact, a consequence of the way the ratio is defined. Taxa can be recorded as both predator and prey and, hence, be double counted. In many webs the proportion of species that are double counted is large; consequently, the ratio of predator species to prey species will inevitably be roughly equal to one. -from Authors
Consider a random sample of genes at a locus, drawn from a population evolving according to the infinitely many, neutral, alleles model. The sample will have a most recent common ancestor gene, which we shall call 'Eve'. The probability distribution, for the number of genes of oldest allelic type in a sample, is known and has a neat form. Rather less is known about the distribution for the number of genes in the sample which are of the same allelic type as Eve possessed. If the latter number is positive, then these genes are automatically of the oldest type in the sample. But Eve may have no non-mutant descendants in the sample; then, the oldest allele will be a mutant arising in a line of descent after Eve. The paper studies the number of non-mutant descendants from Eve, its distribution and moments. It seems that there may be few neat results. In large samples, the proportion of genes of Eve's type has an approximate beta-like density, together with a discrete probability atom at zero, if the mutation rate parameter is low. Extinction of the allele of even the population's common ancestor is possible, but not certain, and bounds are obtained for its probability. Some comments are made about the applications and implications of the results for human mitochondrial DNA.
ANNALS OF PROBABILITY, 20 (1), pp. 322-341. | Citations: 2 (Web of Science Lite) | Read more1992. WEAK-CONVERGENCE OF POPULATION GENEALOGICAL PROCESSES TO THE COALESCENT WITH AGES
ANNALS OF PROBABILITY, 19 (3), pp. 1102-1117. | Citations: 13 (Web of Science Lite) | Read more1991. WEAK-CONVERGENCE TO A MARKOV-CHAIN WITH AN ENTRANCE BOUNDARY - ANCESTRAL PROCESSES IN POPULATION-GENETICS
ADVANCES IN APPLIED PROBABILITY, 23 (2), pp. 229-258. | Citations: 24 (Web of Science Lite) | Read more1991. CONSISTENT ORDERED SAMPLING DISTRIBUTIONS - CHARACTERIZATION AND CONVERGENCE
JOURNAL OF APPLIED PROBABILITY, 28 (2), pp. 321-335. | Citations: 12 (Web of Science Lite) | Read more1991. THE HEAPS PROCESS, LIBRARIES, AND SIZE-BIASED PERMUTATIONS
As part of our effort to construct a physical map of the genome of Arabidopsis thaliana we have made a mathematical analysis of our experimental approach of anchoring yeast artificial chromosome clones with genetically mapped RFLPs and RAPDs. The details of this analysis are presented and their implications for mapping the Arabidopsis genome are discussed.
JOURNAL OF APPLIED PROBABILITY, 26 (3), pp. 477-489. | Citations: 1 (Web of Science Lite) | Read more1989. ON REINFORCEMENT-DEPLETION COMPARTMENTAL URN MODELS
Ranked and size-biased permutations are particular functions on the set of probability measures on the simplex. They represent two recently studied schemes for relabelling groups in certain stochastic models, and are of particular interest in describing the limiting behaviour of such models. We prove that the ranked permutations of a sequence of measures converge if and only if the size-biased permutations converge, and give conditions under which weak convergence of measures guarantees weak convergence of both permutations. Applications include a proof of the fact that the GEM distribution is the size biased permutation of the Poisson-Dirichlet and a new proof of the fact that when labelled in a particular way, normalized cycle lengths in a random permutation converge to the GEM distribution. These techniques also allow some problems concerned with the random splitting of an interval to be related to known results in other fields. © 1989.
We adapt the Moran model for neutral reproduction to allow for correlations in offspring numbers between successive generations. Such correlations (perhaps caused by linkage disequilibrium with a non-neutral locus, or a varying environment) rather than the action of natural selection might account for departures from the neutral-theory distribution of allele frequencies in a sample. The conclusion, however, is that while the number of alleles present will tend to be smaller, the conditional distribution of allele frequencies remains unchanged. There is some evidence that this conclusion might remain valid for more general models. © 1989.
Variability in compartmental models is shown to be related to interparticle correlations. This permits a unified study of a wide variety of models. In the single-compartment case explicit expressions are given for this interparticle correlation, and hence for the variability of the model, while in the multicompartment setting increased variability of many models is shown to be a consequence of exchangeability and de Finetti's theorem.
ADVANCES IN APPLIED PROBABILITY, 19 (4), pp. 755-766. | Citations: 13 (Web of Science Lite) | Read more1987. INTERPARTICLE CORRELATION IN DEATH PROCESSES WITH APPLICATION TO VARIABILITY IN COMPARTMENTAL-MODELS
A process analogous to Kingman's coalescent is introduced to describe the genealogy of populations evolving according to the infinitely- many neutral alleles model. The process records population frequencies in old and new classes, and labels the new classes in order of decreasing age. Its marginal distribution is characterized in a form which is amenable to explicit calculations and the transition densities of the associated K-allele models follow readily from this representation.
Various techniques of cluster analysis are applied to 125 civil parishes in southern Hampshire and 55 civil parishes in southern Staffordshire. Despite the differing infrastructure, the analysis identifies within both a similar spectrum of settlements defined by size, tenancy,and age structure: small Established villages with a high proportion of tied and rented property, a high percentage of village workers and typically an old age profile; large Metropolitan villages which are dominated by owner occupied property, and possessing a younger age profile and a high proportion of commuters; and between these two extremes, Uniform villages, which are average in size with a tenure system balanced between tied and privately rented, local authority, and owner occupied property, some commuters and a balanced or slightly older than average age profile. Concludes that owing to the current planning policy of both these counties, such a spectrum is likely to remain relatively stable. -from Authors
It has recently been shown that the Ewens sampling formula may be generated by a Polya-like urn model. A genealogical proof of this result equates the labelling of balls in the urn to the partition by age of alleles in the sample. This urn construction is shown to be equivalent to the construction of Kingman (Proc. Roy. Soc. London Ser. A 361 (1978), 1-20) using a Poisson-Dirichlet "paintbox" and as a consequence, the partition by ages is seen to be equivalent to the size biased permutation of the Poisson-Dirichlet distribution. This approach unifies and extends many results on ages of alleles, the Polya urn, and the Poisson-Dirichlet distribution. Furthermore the Ewens sampling formula is characterized as being the only partition structure which may be generated by an urn-like mechanism.
JOURNAL OF APPLIED PROBABILITY, 23 (2), pp. 283-296. | Citations: 14 (Web of Science Lite) | Read more1986. A GENEALOGICAL APPROACH TO VARIABLE-POPULATION-SIZE MODELS IN POPULATION-GENETICS
ADVANCES IN APPLIED PROBABILITY, 18 (1), pp. 1-19. | Citations: 159 (Web of Science Lite) | Read more1986. THE AGES OF ALLELES AND A COALESCENT
LECTURE NOTES IN MATHEMATICS, 1212 pp. 94-105. | Citations: 6 (Web of Science Lite)1986. DUAL PROCESSES IN POPULATION-GENETICS
By means of a representation as interactive particle systems, dual processes are constructed for a large class of exchangeable models in population genetics. It is shown that as the population size becomes large these dual processes tend in distribution to a particularly tractable limiting dual process. Properties of the models are analyzed using the duality relationship and approximate expressions are obtained for various quantities. Diffusion approximations follow easily from the invariance result.
The Wright-Fisher model is considered in the case where the population size is random and the magnitude of the selective advantage of one of the alleles varies with time. The central question addressed is the possibility of ultimate genetic polymorphism. Partial results are obtained in the general case and complete results in the case where the population size and selective advantage are not density dependent. Bounds on the fixation probability are obtained when the selective advantage is constant. © 1985 Springer-Verlag.
MATHEMATICAL PROCEEDINGS OF THE CAMBRIDGE PHILOSOPHICAL SOCIETY, 95 (MAR), pp. 349-358. | Citations: 36 (Web of Science Lite) | Read more1984. THE TRANSIENT-BEHAVIOR OF THE MORAN MODEL IN POPULATION-GENETICS
MATHEMATICAL PROCEEDINGS OF THE CAMBRIDGE PHILOSOPHICAL SOCIETY, 94 (JUL), pp. 167-182. | Citations: 28 (Web of Science Lite) | Read more1983. FINITE PARTICLE-SYSTEMS AND INFECTION MODELS
Total publications on this page: 208
Total citations for publications on this page: 77151