# Professor Christopher Yau

Research Area: Bioinformatics & Stats (inc. Modelling and Computational Biology) Bioinformatics and Statistical genetics Genetics & Genomics and Cancer Biology Statistics, Computational Biology, Cancer and Bioinformatics

The focus of my group is the development of computational statistical methods for applications in genetics and genomics. The main areas of work include:

• Translational Computational Oncogenomics. The development of cutting edge computational statistical methods and tools that can be widely used by specialists and non-specialists alike for research and clinical practice in cancer.
• Single Cell Informatics. Developing novel statistical techniques for single cell genomics.
• Data-driven Statistics. Developing novel statistical ideas and generic techniques that are inspired by real data analysis problems in genetics and genomics. Developing computational techniques formulated on sound statistical principals for the the analysis of very large datasets commonly found in genomics (Big Data) and learning biologically relevant features (Deep or Representation Learning).

Name Department Institution Country
Professor Chris Holmes Wellcome Trust Centre for Human Genetics Oxford University, Henry Wellcome Building of Genomic Medicine United Kingdom
Professor Ian Tomlinson Wellcome Trust Centre for Human Genetics Oxford University, Henry Wellcome Building of Genomic Medicine United Kingdom
Dr Jean-Baptiste Cazier Wellcome Trust Centre for Human Genetics Oxford University, Henry Wellcome Building of Genomic Medicine United Kingdom
Prof Ahmed Ashour Ahmed Weatherall Institute of Molecular Medicine Oxford University, Weatherall Institute of Molecular Medicine United Kingdom
Dr Samantha JL Knight Wellcome Trust Centre for Human Genetics Oxford University, Henry Wellcome Building of Genomic Medicine United Kingdom
Professor Mark McCarthy Oxford University, Oxford Centre for Diabetes, Endocrinology & Metabolism United Kingdom
Prof Anna L Gloyn OCDEM Oxford University, Oxford Centre for Diabetes, Endocrinology & Metabolism United Kingdom
Prof Adam Mead MRCP FRCPath Weatherall Institute of Molecular Medicine Oxford University, Weatherall Institute of Molecular Medicine United Kingdom
Dr Caleb Webber Department of Physiology, Anatomy and Genetics University of Oxford United Kingdom
Dr Michalis Titsias Department of Informatics Athens University of Economics and Business Greece
Dr Richard Savage Systems Biology Centre University of Warwick United Kingdom
. 2018. Uncovering pseudotemporal trajectories with covariates from single cell and bulk expression data. Nat Commun, 9 (1), pp. 2442. | Show Abstract | Read more

Pseudotime algorithms can be employed to extract latent temporal information from cross-sectional data sets allowing dynamic biological processes to be studied in situations where the collection of time series data is challenging or prohibitive. Computational techniques have arisen from single-cell 'omics and cancer modelling where pseudotime can be used to learn about cellular differentiation or tumour progression. However, methods to date typically implicitly assume homogeneous genetic, phenotypic or environmental backgrounds, which becomes limiting as data sets grow in size and complexity. We describe a novel statistical framework that learns how pseudotime trajectories can be modulated through covariates that encode such factors. We apply this model to both single-cell and bulk gene expression data sets and show that the approach can recover known and novel covariate-pseudotime interaction effects. This hybrid regression-latent variable model framework extends pseudotemporal modelling from its most prevalent area of single cell genomics to wider applications.

et al. 2018. Tuning microtubule dynamics to enhance cancer therapy by modulating FER-mediated CRMP2 phosphorylation. Nat Commun, 9 (1), pp. 476. | Show Abstract | Read more

Though used widely in cancer therapy, paclitaxel only elicits a response in a fraction of patients. A strong determinant of paclitaxel tumor response is the state of microtubule dynamic instability. However, whether the manipulation of this physiological process can be controlled to enhance paclitaxel response has not been tested. Here, we show a previously unrecognized role of the microtubule-associated protein CRMP2 in inducing microtubule bundling through its carboxy terminus. This activity is significantly decreased when the FER tyrosine kinase phosphorylates CRMP2 at Y479 and Y499. The crystal structures of wild-type CRMP2 and CRMP2-Y479E reveal how mimicking phosphorylation prevents tetramerization of CRMP2. Depletion of FER or reducing its catalytic activity using sub-therapeutic doses of inhibitors increases paclitaxel-induced microtubule stability and cytotoxicity in ovarian cancer cells and in vivo. This work provides a rationale for inhibiting FER-mediated CRMP2 phosphorylation to enhance paclitaxel on-target activity for cancer therapy.

. 2017. Separation of Dual Oxidase 2 and Lactoperoxidase Expression in Intestinal Crypts and Species Differences May Limit Hydrogen Peroxide Scavenging During Mucosal Healing in Mice and Humans. Inflamm Bowel Dis, 24 (1), pp. 136-148. | Show Abstract | Read more

Background: DUOX2 and DUOXA2 form the predominant H2O2-producing system in human colorectal mucosa. Inflammation, hypoxia, and 5-aminosalicylic acid increase H2O2 production, supporting innate defense and mucosal healing. Thiocyanate reacts with H2O2 in the presence of lactoperoxidase (LPO) to form hypothiocyanate (OSCN-), which acts as a biocide and H2O2 scavenging system to reduce damage during inflammation. We aimed to discover the organization of Duox2, Duoxa2, and Lpo expression in colonic crypts of Lieberkühn (intestinal glands) of mice and how distributions respond to dextran sodium sulfate (DSS)-induced colitis and subsequent mucosal regeneration. Methods: We studied tissue from DSS-exposed mice and human biopsies using in situ hybridization, reverse transcription quantitative polymerase chain reaction, and cDNA microarray analysis. Results: Duox2 mRNA expression was mostly in the upper crypt quintile while Duoxa2 was more apically focused. Most Lpo mRNA was in the basal quintile, where stem cells reside. Duox2 and Duoxa2 mRNA were increased during the induction and resolution of DSS colitis, while Lpo expression did not increase during the acute phase. Patterns of Lpo expression differed from Duox2 in normal, inflamed, and regenerative mouse crypts (P < 0.001). We found no evidence of LPO expression in the human gut. Conclusions: The spatial and temporal separation of H2O2-consuming and -producing enzymes enables a thiocyanate- H2O2 "scavenging" system in murine intestinal crypts to protect the stem/proliferative zones from DNA damage, while still supporting higher H2O2 concentrations apically to aid mucosal healing. The absence of LPO expression in the human gut suggests an alternative mechanism or less protection from DNA damage during H2O2-driven mucosal healing.

et al. 2018. NOX1 loss-of-function genetic variants in patients with inflammatory bowel disease. Mucosal Immunol, 11 (2), pp. 562-574. | Show Abstract | Read more

Genetic defects that affect intestinal epithelial barrier function can present with very early-onset inflammatory bowel disease (VEOIBD). Using whole-genome sequencing, a novel hemizygous defect in NOX1 encoding NAPDH oxidase 1 was identified in a patient with ulcerative colitis-like VEOIBD. Exome screening of 1,878 pediatric patients identified further seven male inflammatory bowel disease (IBD) patients with rare NOX1 mutations. Loss-of-function was validated in p.N122H and p.T497A, and to a lesser degree in p.Y470H, p.R287Q, p.I67M, p.Q293R as well as the previously described p.P330S, and the common NOX1 SNP p.D360N (rs34688635) variant. The missense mutation p.N122H abrogated reactive oxygen species (ROS) production in cell lines, ex vivo colonic explants, and patient-derived colonic organoid cultures. Within colonic crypts, NOX1 constitutively generates a high level of ROS in the crypt lumen. Analysis of 9,513 controls and 11,140 IBD patients of non-Jewish European ancestry did not reveal an association between p.D360N and IBD. Our data suggest that loss-of-function variants in NOX1 do not cause a Mendelian disorder of high penetrance but are a context-specific modifier. Our results implicate that variants in NOX1 change brush border ROS within colonic crypts at the interface between the epithelium and luminal microbes.

. 2017. The Hamming Ball Sampler. J Am Stat Assoc, 112 (520), pp. 1598-1611. | Show Abstract | Read more

We introduce the Hamming ball sampler, a novel Markov chain Monte Carlo algorithm, for efficient inference in statistical models involving high-dimensional discrete state spaces. The sampling scheme uses an auxiliary variable construction that adaptively truncates the model space allowing iterative exploration of the full model space. The approach generalizes conventional Gibbs sampling schemes for discrete spaces and provides an intuitive means for user-controlled balance between statistical efficiency and computational tractability. We illustrate the generic utility of our sampling algorithm through application to a range of statistical models. Supplementary materials for this article are available online.

. 2017. A pan-cancer genome-wide analysis reveals tumour dependencies by induction of nonsense-mediated decay. Nat Commun, 8 pp. 15943. | Show Abstract | Read more

Nonsense-mediated decay (NMD) eliminates transcripts with premature termination codons. Although NMD-induced loss-of-function has been shown to contribute to the genesis of particular cancers, its global functional consequence in tumours has not been characterized. Here we develop an algorithm to predict NMD and apply it on somatic mutations reported in The Cancer Genome Atlas. We identify more than 73 K mutations that are predicted to elicit NMD (NMD-elicit). NMD-elicit mutations in tumour suppressor genes (TSGs) are associated with significant reduction in gene expression. We discover cancer-specific NMD-elicit signatures in TSGs and cancer-associated genes. Our analysis reveals a previously unrecognized dependence of hypermutated tumours on hypofunction of genes that are involved in chromatin remodelling and translation. Half of hypermutated stomach adenocarcinomas are associated with NMD-elicit mutations of the translation initiators LARP4B and EIF5B. Our results unravel strong therapeutic opportunities by targeting tumour dependencies on NMD-elicit mutations.

et al. 2017. Colorectal Cancer Cell Line Proteomes Are Representative of Primary Tumors and Predict Drug Sensitivity. Gastroenterology, 153 (4), pp. 1082-1095. | Show Abstract | Read more

BACKGROUND AND AIMS: Proteomics holds promise for individualizing cancer treatment. We analyzed to what extent the proteomic landscape of human colorectal cancer (CRC) is maintained in established CRC cell lines and the utility of proteomics for predicting therapeutic responses. METHODS: Proteomic and transcriptomic analyses were performed on 44 CRC cell lines, compared against primary CRCs (n=95) and normal tissues (n=60), and integrated with genomic and drug sensitivity data. RESULTS: Cell lines mirrored the proteomic aberrations of primary tumors, in particular for intrinsic programs. Tumor relationships of protein expression with DNA copy number aberrations and signatures of post-transcriptional regulation were recapitulated in cell lines. The 5 proteomic subtypes previously identified in tumors were represented among cell lines. Nonetheless, systematic differences between cell line and tumor proteomes were apparent, attributable to stroma, extrinsic signaling, and growth conditions. Contribution of tumor stroma obscured signatures of DNA mismatch repair identified in cell lines with a hypermutation phenotype. Global proteomic data showed improved utility for predicting both known drug-target relationships and overall drug sensitivity as compared with genomic or transcriptomic measurements. Inhibition of targetable proteins associated with drug responses further identified corresponding synergistic or antagonistic drug combinations. Our data provide evidence for CRC proteomic subtype-specific drug responses. CONCLUSIONS: Proteomes of established CRC cell line are representative of primary tumors. Proteomic data tend to exhibit improved prediction of drug sensitivity as compared with genomic and transcriptomic profiles. Our integrative proteogenomic analysis highlights the potential of proteome profiling to inform personalized cancer medicine.

. 2017. switchde: inference of switch-like differential expression along single-cell trajectories. Bioinformatics, 33 (8), pp. 1241-1242. | Show Abstract | Read more

Motivation: Pseudotime analyses of single-cell RNA-seq data have become increasingly common. Typically, a latent trajectory corresponding to a biological process of interest-such as differentiation or cell cycle-is discovered. However, relatively little attention has been paid to modelling the differential expression of genes along such trajectories. Results: We present switchde , a statistical framework and accompanying R package for identifying switch-like differential expression of genes along pseudotemporal trajectories. Our method includes fast model fitting that provides interpretable parameter estimates corresponding to how quickly a gene is up or down regulated as well as where in the trajectory such regulation occurs. It also reports a P -value in favour of rejecting a constant-expression model for switch-like differential expression and optionally models the zero-inflation prevalent in single-cell data. Availability and Implementation: The R package switchde is available through the Bioconductor project at https://bioconductor.org/packages/switchde . Contact: kieran.campbell@sjc.ox.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.

et al. 2017. Germline Mutations in PALB2, BRCA1, and RAD51C, Which Regulate DNA Recombination Repair, in Patients With Gastric Cancer. Gastroenterology, 152 (5), pp. 983-986.e6. | Show Abstract | Read more

Up to 10% of cases of gastric cancer are familial, but so far, only mutations in CDH1 have been associated with gastric cancer risk. To identify genetic variants that affect risk for gastric cancer, we collected blood samples from 28 patients with hereditary diffuse gastric cancer (HDGC) not associated with mutations in CDH1 and performed whole-exome sequence analysis. We then analyzed sequences of candidate genes in 333 independent HDGC and non-HDGC cases. We identified 11 cases with mutations in PALB2, BRCA1, or RAD51C genes, which regulate homologous DNA recombination. We found these mutations in 2 of 31 patients with HDGC (6.5%) and 9 of 331 patients with sporadic gastric cancer (2.8%). Most of these mutations had been previously associated with other types of tumors and partially co-segregated with gastric cancer in our study. Tumors that developed in patients with these mutations had a mutation signature associated with somatic homologous recombination deficiency. Our findings indicate that defects in homologous recombination increase risk for gastric cancer.

. 2017. Probabilistic modeling of bifurcations in single-cell gene expression data using a Bayesian mixture of factor analyzers. Wellcome Open Res, 2 pp. 19. | Show Abstract | Read more

Modeling bifurcations in single-cell transcriptomics data has become an increasingly popular field of research. Several methods have been proposed to infer bifurcation structure from such data, but all rely on heuristic non-probabilistic inference. Here we propose the first generative, fully probabilistic model for such inference based on a Bayesian hierarchical mixture of factor analyzers. Our model exhibits competitive performance on large datasets despite implementing full Markov-Chain Monte Carlo sampling, and its unique hierarchical prior structure enables automatic determination of genes driving the bifurcation process. We additionally propose an Empirical-Bayes like extension that deals with the high levels of zero-inflation in single-cell RNA-seq data and quantify when such models are useful. We apply or model to both real and simulated single-cell gene expression data and compare the results to existing pseudotime methods. Finally, we discuss both the merits and weaknesses of such a unified, probabilistic approach in the context practical bioinformatics analyses.

. 2016. Order Under Uncertainty: Robust Differential Expression Analysis Using Probabilistic Models for Pseudotime Inference. PLoS Comput Biol, 12 (11), pp. e1005212. | Show Abstract | Read more

Single cell gene expression profiling can be used to quantify transcriptional dynamics in temporal processes, such as cell differentiation, using computational methods to label each cell with a 'pseudotime' where true time series experimentation is too difficult to perform. However, owing to the high variability in gene expression between individual cells, there is an inherent uncertainty in the precise temporal ordering of the cells. Pre-existing methods for pseudotime estimation have predominantly given point estimates precluding a rigorous analysis of the implications of uncertainty. We use probabilistic modelling techniques to quantify pseudotime uncertainty and propagate this into downstream differential expression analysis. We demonstrate that reliance on a point estimate of pseudotime can lead to inflated false discovery rates and that probabilistic approaches provide greater robustness and measures of the temporal resolution that can be obtained from pseudotime inference.

Cited:

20

Scopus

et al. 2016. Salt-Inducible Kinase 2 Couples Ovarian Cancer Cell Metabolism with Survival at the Adipocyte-Rich Metastatic Niche. Cancer Cell, 30 (2), pp. 273-289. | Show Abstract | Read more

The adipocyte-rich microenvironment forms a niche for ovarian cancer metastasis, but the mechanisms driving this process are incompletely understood. Here we show that salt-inducible kinase 2 (SIK2) is overexpressed in adipocyte-rich metastatic deposits compared with ovarian primary lesions. Overexpression of SIK2 in ovarian cancer cells promotes abdominal metastasis while SIK2 depletion prevents metastasis in vivo. Importantly, adipocytes induce calcium-dependent activation and autophosphorylation of SIK2. Activated SIK2 plays a dual role in augmenting AMPK-induced phosphorylation of acetyl-CoA carboxylase and in activating the PI3K/AKT pathway through p85α-S154 phosphorylation. These findings identify SIK2 at the apex of the adipocyte-induced signaling cascades in cancer cells and make a compelling case for targeting SIK2 for therapy in ovarian cancer.

et al. 2016. Premalignant SOX2 overexpression in the fallopian tubes of ovarian cancer patients: Discovery and validation studies. EBioMedicine, 10 pp. 137-149. | Show Abstract | Read more

Current screening methods for ovarian cancer can only detect advanced disease. Earlier detection has proved difficult because the molecular precursors involved in the natural history of the disease are unknown. To identify early driver mutations in ovarian cancer cells, we used dense whole genome sequencing of micrometastases and microscopic residual disease collected at three time points over three years from a single patient during treatment for high-grade serous ovarian cancer (HGSOC). The functional and clinical significance of the identified mutations was examined using a combination of population-based whole genome sequencing, targeted deep sequencing, multi-center analysis of protein expression, loss of function experiments in an in-vivo reporter assay and mammalian models, and gain of function experiments in primary cultured fallopian tube epithelial (FTE) cells. We identified frequent mutations involving a 40kb distal repressor region for the key stem cell differentiation gene SOX2. In the apparently normal FTE, the region was also mutated. This was associated with a profound increase in SOX2 expression (p<2(-16)), which was not found in patients without cancer (n=108). Importantly, we show that SOX2 overexpression in FTE is nearly ubiquitous in patients with HGSOCs (n=100), and common in BRCA1-BRCA2 mutation carriers (n=71) who underwent prophylactic salpingo-oophorectomy. We propose that the finding of SOX2 overexpression in FTE could be exploited to develop biomarkers for detecting disease at a premalignant stage, which would reduce mortality from this devastating disease.

. 2016. A Method to Exploit the Structure of Genetic Ancestry Space to Enhance Case-Control Studies. Am J Hum Genet, 98 (5), pp. 857-868. | Show Abstract | Read more

One goal of human genetics is to understand the genetic basis of disease, a challenge for diseases of complex inheritance because risk alleles are few relative to the vast set of benign variants. Risk variants are often sought by association studies in which allele frequencies in case subjects are contrasted with those from population-based samples used as control subjects. In an ideal world we would know population-level allele frequencies, releasing researchers to focus on case subjects. We argue this ideal is possible, at least theoretically, and we outline a path to achieving it in reality. If such a resource were to exist, it would yield ample savings and would facilitate the effective use of data repositories by removing administrative and technical barriers. We call this concept the Universal Control Repository Network (UNICORN), a means to perform association analyses without necessitating direct access to individual-level control data. Our approach to UNICORN uses existing genetic resources and various statistical tools to analyze these data, including hierarchical clustering with spectral analysis of ancestry; and empirical Bayesian analysis along with Gaussian spatial processes to estimate ancestry-specific allele frequencies. We demonstrate our approach using tens of thousands of control subjects from studies of Crohn disease, showing how it controls false positives, provides power similar to that achieved when all control data are directly accessible, and enhances power when control data are limiting or even imperfectly matched ancestrally. These results highlight how UNICORN can enable reliable, powerful, and convenient genetic association analyses without access to the individual-level data.

. 2016. Statistical Inference in Hidden Markov Models Using k-Segment Constraints. J Am Stat Assoc, 111 (513), pp. 200-215. | Show Abstract | Read more

Hidden Markov models (HMMs) are one of the most widely used statistical methods for analyzing sequence data. However, the reporting of output from HMMs has largely been restricted to the presentation of the most-probable (MAP) hidden state sequence, found via the Viterbi algorithm, or the sequence of most probable marginals using the forward-backward algorithm. In this article, we expand the amount of information we could obtain from the posterior distribution of an HMM by introducing linear-time dynamic programming recursions that, conditional on a user-specified constraint in the number of segments, allow us to (i) find MAP sequences, (ii) compute posterior probabilities, and (iii) simulate sample paths. We collectively call these recursions k-segment algorithms and illustrate their utility using simulated and real examples. We also highlight the prospective and retrospective use of k-segment constraints for fitting HMMs or exploring existing model fits. Supplementary materials for this article are available online.

Cited:

23

Scopus

. 2016. pcaReduce: hierarchical clustering of single cell transcriptional profiles. BMC Bioinformatics, 17 (1), pp. 140. | Show Abstract | Read more

BACKGROUND: Advances in single cell genomics provide a way of routinely generating transcriptomics data at the single cell level. A frequent requirement of single cell expression analysis is the identification of novel patterns of heterogeneity across single cells that might explain complex cellular states or tissue composition. To date, classical statistical analysis tools have being routinely applied, but there is considerable scope for the development of novel statistical approaches that are better adapted to the challenges of inferring cellular hierarchies. RESULTS: We have developed a novel agglomerative clustering method that we call pcaReduce to generate a cell state hierarchy where each cluster branch is associated with a principal component of variation that can be used to differentiate two cell states. Using two real single cell datasets, we compared our approach to other commonly used statistical techniques, such as K-means and hierarchical clustering. We found that pcaReduce was able to give more consistent clustering structures when compared to broad and detailed cell type labels. CONCLUSIONS: Our novel integration of principal components analysis and hierarchical clustering establishes a connection between the representation of the expression data and the number of cell types that can be discovered. In doing so we found that pcaReduce performs better than either technique in isolation in terms of characterising putative cell states. Our methodology is complimentary to other single cell clustering techniques and adds to a growing palette of single cell bioinformatics tools for profiling heterogeneous cell populations.

Cited:

81

Scopus

. 2015. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol, 16 (1), pp. 241. | Show Abstract | Read more

Single-cell RNA-seq data allows insight into normal cellular function and various disease states through molecular characterization of gene expression on the single cell level. Dimensionality reduction of such high-dimensional data sets is essential for visualization and analysis, but single-cell RNA-seq data are challenging for classical dimensionality-reduction methods because of the prevalence of dropout events, which lead to zero-inflated data. Here, we develop a dimensionality-reduction method, (Z)ero (I)nflated (F)actor (A)nalysis (ZIFA), which explicitly models the dropout characteristics, and show that it improves modeling accuracy on simulated and biological data sets.

. 2015. The Alive Particle Filter and Its Use in Particle Markov Chain Monte Carlo STOCHASTIC ANALYSIS AND APPLICATIONS, 33 (6), pp. 943-974. | Show Abstract | Read more

© 2015, Copyright © Taylor & Francis Group, LLC. In the following article, we investigate a particle filter for approximating Feynman–Kac models with indicator potentials and we use this algorithm within Markov chain Monte Carlo (MCMC) to learn static parameters of the model. Examples of such models include approximate Bayesian computation (ABC) posteriors associated with hidden Markov models (HMMs) or rare-event problems. Such models require the use of advanced particle filter or MCMC algorithms to perform estimation. One of the drawbacks of existing particle filters is that they may “collapse,” in that the algorithm may terminate early, due to the indicator potentials. In this article, using a newly developed special case of the locally adaptive particle filter, we use an algorithm that can deal with this latter problem, while introducing a random cost per-time step. In particular, we show how this algorithm can be used within MCMC, using particle MCMC. It is established that, when not taking into account computational time, when the new MCMC algorithm is applied to a simplified model it has a lower asymptotic variance in comparison to a standard particle MCMC algorithm. Numerical examples are presented for ABC approximations of HMMs.

Cited:

99

Scopus

et al. 2015. Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nat Genet, 47 (7), pp. 717-726. | Show Abstract | Read more

To assess factors influencing the success of whole-genome sequencing for mainstream clinical diagnosis, we sequenced 217 individuals from 156 independent cases or families across a broad spectrum of disorders in whom previous screening had identified no pathogenic variants. We quantified the number of candidate variants identified using different strategies for variant calling, filtering, annotation and prioritization. We found that jointly calling variants across samples, filtering against both local and external databases, deploying multiple annotation tools and using familial transmission above biological plausibility contributed to accuracy. Overall, we identified disease-causing variants in 21% of cases, with the proportion increasing to 34% (23/68) for mendelian disorders and 57% (8/14) in family trios. We also discovered 32 potentially clinically actionable variants in 18 genes unrelated to the referral disorder, although only 4 were ultimately considered reportable. Our results demonstrate the value of genome sequencing for routine clinical diagnosis but also highlight many outstanding challenges.

et al. 2014. The Identification of Further Minimal Regions of Overlap in Chronic Lymphocytic Leukemia Using High-Resolution SNP Arrays BLOOD, 124 (21),

. 2014. A Sequential Algorithm for Fast Fitting of Dirichlet Process Mixture Models JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 23 (4), pp. 1143-1162. | Show Abstract | Read more

In this article we propose an improvement on the sequential updating and greedy search (SUGS) algorithm Wang and Dunson for fast fitting of Dirichlet process mixture models. The SUGS algorithm provides a means for very fast approximate Bayesian inference for mixture data which is particularly of use when data sets are so large that many standard Markov chain Monte Carlo (MCMC) algorithms cannot be applied efficiently, or take a prohibitively long time to converge. In particular, these ideas are used to initially interrogate the data, and to refine models such that one can potentially apply exact data analysis later on. SUGS relies upon sequentially allocating data to clusters and proceeding with an update of the posterior on the subsequent allocations and parameters which assumes this allocation is correct. Our modification softens this approach, by providing a probability distribution over allocations, with a similar computational cost; this approach has an interpretation as a variational Bayes procedure and hence we term it variational SUGS (VSUGS). It is shown in simulated examples that VSUGS can out-perform, in terms of density estimation and classification, the original SUGS algorithm in many scenarios. In addition, we present a data analysis for flow cytometry data, and SNP data via a three-class dirichlet process mixture model illustrating the apparent improvement over SUGS.

. 2014. Erythrocytosis associated with a novel missense mutation in the BPGM gene. Haematologica, 99 (10), pp. e201-e204. | Read more

et al. 2014. Corrigendum: Whole-genome sequencing of bladder cancers reveals somatic CDKN1A mutations and clinicopathological associations with mutation burden. Nat Commun, 5 (1), pp. 4809. | Read more

et al. 2014. Corrigendum: Whole-genome sequencing of bladder cancers reveals somatic CDKN1A mutations and clinicopathological associations with mutation burden. Nat Commun, 5 (1), pp. 4264. | Read more

et al. 2014. Whole-genome sequencing of bladder cancers reveals somatic CDKN1A mutations and clinicopathological associations with mutation burden. Nat Commun, 5 (1), pp. 3756. | Show Abstract | Read more

Bladder cancers are a leading cause of death from malignancy. Molecular markers might predict disease progression and behaviour more accurately than the available prognostic factors. Here we use whole-genome sequencing to identify somatic mutations and chromosomal changes in 14 bladder cancers of different grades and stages. As well as detecting the known bladder cancer driver mutations, we report the identification of recurrent protein-inactivating mutations in CDKN1A and FAT1. The former are not mutually exclusive with TP53 mutations or MDM2 amplification, showing that CDKN1A dysfunction is not simply an alternative mechanism for p53 pathway inactivation. We find strong positive associations between higher tumour stage/grade and greater clonal diversity, the number of somatic mutations and the burden of copy number changes. In principle, the identification of sub-clones with greater diversity and/or mutation burden within early-stage or low-grade tumours could identify lesions with a high risk of invasive progression.

Cited:

63

Scopus

et al. 2013. Survival in stage II/III colorectal cancer is independently predicted by chromosomal and microsatellite instability, but not by specific driver mutations American Journal of Gastroenterology, 108 (11), pp. 1785-1793. | Show Abstract | Read more

OBJECTIVES:Microsatellite instability (MSI) is an established marker of good prognosis in colorectal cancer (CRC). Chromosomal instability (CIN) is strongly negatively associated with MSI and has been shown to be a marker of poor prognosis in a small number of studies. However, a substantial group of double-negative (MSI-/CIN-) CRCs exists. The prognosis of these patients is unclear. Furthermore, MSI and CIN are each associated with specific molecular changes, such as mutations in KRAS and BRAF, that have been associated with prognosis. It is not known which of MSI, CIN, and the specific gene mutations are primary predictors of survival.METHODS:We evaluated the prognostic value (disease-free survival, DFS) of CIN, MSI, mutations in KRAS, NRAS, BRAF, PIK3CA, FBXW7, and TP53, and chromosome 18q loss-of-heterozygosity (LOH) in 822 patients from the VICTOR trial of stage II/III CRC. We followed up promising associations in an Australian community-based cohort (N=375).RESULTS:In the VICTOR patients, no specific mutation was associated with DFS, but individually MSI and CIN showed significant associations after adjusting for stage, age, gender, tumor location, and therapy. A combined analysis of the VICTOR and community-based cohorts showed that MSI and CIN were independent predictors of DFS (for MSI, hazard ratio (HR)=0.58, 95% confidence interval (CI) 0.36-0.93, and P=0.021; for CIN, HR=1.54, 95% CI 1.14-2.08, and P=0.005), and joint CIN/MSI testing significantly improved the prognostic prediction of MSI alone (P=0.028). Higher levels of CIN were monotonically associated with progressively poorer DFS, and a semi-quantitative measure of CIN was a better predictor of outcome than a simple CIN+/-variable. All measures of CIN predicted DFS better than the recently described Watanabe LOH ratio.CONCLUSIONS:MSI and CIN are independent predictors of DFS for stage II/III CRC. Prognostic molecular tests for CRC relapse should currently use MSI and a quantitative measure of CIN rather than specific gene mutations.

Cited:

23

Scopus

. 2013. OncoSNP-SEQ: a statistical approach for the identification of somatic copy number alterations from next-generation sequencing of cancer genomes. Bioinformatics, 29 (19), pp. 2482-2484. | Show Abstract | Read more

SUMMARY: Recent major cancer genome sequencing studies have used whole-genome sequencing to detect various types of genomic variation. However, a number of these studies have continued to rely on SNP array information to provide additional results for copy number and loss-of-heterozygosity estimation and assessing tumour purity. OncoSNP-SEQ is a statistical model-based approach for inferring copy number profiles directly from high-coverage whole genome sequencing data that is able to account for unknown tumour purity and ploidy. AVAILABILITY: MATLAB code is available at the following URL: https://sites.google.com/site/oncosnpseq/.

et al. 2013. Survival in stage II/III colorectal cancer is independently predicted by chromosomal and microsatellite instability, but not by specific driver mutations. Am J Gastroenterol, 108 (11), pp. 1785-1793. | Show Abstract | Read more

OBJECTIVES: Microsatellite instability (MSI) is an established marker of good prognosis in colorectal cancer (CRC). Chromosomal instability (CIN) is strongly negatively associated with MSI and has been shown to be a marker of poor prognosis in a small number of studies. However, a substantial group of "double-negative" (MSI-/CIN-) CRCs exists. The prognosis of these patients is unclear. Furthermore, MSI and CIN are each associated with specific molecular changes, such as mutations in KRAS and BRAF, that have been associated with prognosis. It is not known which of MSI, CIN, and the specific gene mutations are primary predictors of survival. METHODS: We evaluated the prognostic value (disease-free survival, DFS) of CIN, MSI, mutations in KRAS, NRAS, BRAF, PIK3CA, FBXW7, and TP53, and chromosome 18q loss-of-heterozygosity (LOH) in 822 patients from the VICTOR trial of stage II/III CRC. We followed up promising associations in an Australian community-based cohort (N=375). RESULTS: In the VICTOR patients, no specific mutation was associated with DFS, but individually MSI and CIN showed significant associations after adjusting for stage, age, gender, tumor location, and therapy. A combined analysis of the VICTOR and community-based cohorts showed that MSI and CIN were independent predictors of DFS (for MSI, hazard ratio (HR)=0.58, 95% confidence interval (CI) 0.36-0.93, and P=0.021; for CIN, HR=1.54, 95% CI 1.14-2.08, and P=0.005), and joint CIN/MSI testing significantly improved the prognostic prediction of MSI alone (P=0.028). Higher levels of CIN were monotonically associated with progressively poorer DFS, and a semi-quantitative measure of CIN was a better predictor of outcome than a simple CIN+/- variable. All measures of CIN predicted DFS better than the recently described Watanabe LOH ratio. CONCLUSIONS: MSI and CIN are independent predictors of DFS for stage II/III CRC. Prognostic molecular tests for CRC relapse should currently use MSI and a quantitative measure of CIN rather than specific gene mutations.

. 2013. A DECISION-THEORETIC APPROACH FOR SEGMENTAL CLASSIFICATION ANNALS OF APPLIED STATISTICS, 7 (3), pp. 1814-1835. | Show Abstract | Read more

This paper is concerned with statistical methods for the segmental classification of linear sequence data where the task is to segment and classify the data according to an underlying hidden discrete state sequence. Such analysis is commonplace in the empirical sciences including genomics, finance and speech processing. In particular, we are interested in answering the following question: given data $y$ and a statistical model $\pi(x,y)$ of the hidden states $x$, what should we report as the prediction $\hat{x}$ under the posterior distribution $\pi (x|y)$? That is, how should you make a prediction of the underlying states? We demonstrate that traditional approaches such as reporting the most probable state sequence or most probable set of marginal predictions can give undesirable classification artefacts and offer limited control over the properties of the prediction. We propose a decision theoretic approach using a novel class of Markov loss functions and report $\hat{x}$ via the principle of minimum expected loss (maximum expected utility). We demonstrate that the sequence of minimum expected loss under the Markov loss function can be enumerated exactly using dynamic programming methods and that it offers flexibility and performance improvements over existing techniques. The result is generic and applicable to any probabilistic model on a sequence, such as Hidden Markov models, change point or product partition models.

. 2013. NucleoFinder: a statistical approach for the detection of nucleosome positions. Bioinformatics, 29 (6), pp. 711-716. | Show Abstract | Read more

MOTIVATION: The identification of nucleosomes along the chromatin is key to understanding their role in the regulation of gene expression and other DNA-related processes. However, current experimental methods (MNase-ChIP, MNase-Seq) sample nucleosome positions from a cell population and contain biases, making thus the precise identification of individual nucleosomes not straightforward. Recent works have only focused on the first point, where noise reduction approaches have been developed to identify nucleosome positions. RESULTS: In this article, we propose a new approach, termed NucleoFinder, that addresses both the positional heterogeneity across cells and experimental biases by seeking nucleosomes consistently positioned in a cell population and showing a significant enrichment relative to a control sample. Despite the absence of validated dataset, we show that our approach (i) detects fewer false positives than two other nucleosome calling methods and (ii) identifies two important features of the nucleosome organization (the nucleosome spacing downstream of active promoters and the enrichment/depletion of GC/AT dinucleotides at the centre of in vitro nucleosomes) with equal or greater ability than the other two methods.

et al. 2013. Analysis of colorectal cancers in British Bangladeshi identifies early onset, frequent mucinous histotype and a high prevalence of RBFOX1 deletion. Mol Cancer, 12 (1), pp. 1. | Show Abstract | Read more

BACKGROUND: Prevalence of colorectal cancer (CRC) in the British Bangladeshi population (BAN) is low compared to British Caucasians (CAU). Genetic background may influence mutations and disease features. METHODS: We characterized the clinicopathological features of BAN CRCs and interrogated their genomes using mutation profiling and high-density single nucleotide polymorphism (SNP) arrays and compared findings to CAU CRCs. RESULTS: Age of onset of BAN CRC was significantly lower than for CAU patients (p=3.0 x 10-5) and this difference was not due to Lynch syndrome or the polyposis syndromes. KRAS mutations in BAN microsatellite stable (MSS) CRCs were comparatively rare (5.4%) compared to CAU MSS CRCs (25%; p=0.04), which correlates with the high percentage of mucinous histotype observed (31%) in the BAN samples. No BRAF mutations was seen in our BAN MSS CRCs (CAU CRCs, 12%; p=0.08). Array data revealed similar patterns of gains (chromosome 7 and 8q), losses (8p, 17p and 18q) and LOH (4q, 17p and 18q) in BAN and CAU CRCs. A small deletion on chromosome 16p13.2 involving the alternative splicing factor RBFOX1 only was found in significantly more BAN (50%) than CAU CRCs (15%) cases (p=0.04). Focal deletions targeting the 5' end of the gene were also identified. Novel RBFOX1 mutations were found in CRC cell lines and tumours; mRNA and protein expression was reduced in tumours. CONCLUSIONS: KRAS mutations were rare in BAN MSS CRC and a mucinous histotype common. Loss of RBFOX1 may explain the anomalous splicing activity associated with CRC.

Cited:

43

Scopus

et al. 2012. Quantification of subclonal distributions of recurrent genomic aberrations in paired pre-treatment and relapse samples from patients with B-cell chronic lymphocytic leukemia. Leukemia, 26 (7), pp. 1564-1575. | Show Abstract | Read more

Genome-wide array approaches and sequencing analyses are powerful tools for identifying genetic aberrations in cancers, including leukemias and lymphomas. However, the clinical and biological significance of such aberrations and their subclonal distribution are poorly understood. Here, we present the first genome-wide array based study of pre-treatment and relapse samples from patients with B-cell chronic lymphocytic leukemia (B-CLL) that uses the computational statistical tool OncoSNP. We show that quantification of the proportion of copy number alterations (CNAs) and copy neutral loss of heterozygosity regions (cnLOHs) in each sample is feasible. Furthermore, we (i) reveal complex changes in the subclonal architecture of paired samples at relapse compared with pre-treatment, (ii) provide evidence supporting an association between increased genomic complexity and poor clinical outcome (iii) report previously undefined, recurrent CNA/cnLOH regions that expand or newly occur at relapse and therefore might harbor candidate driver genes of relapse and/or chemotherapy resistance. Our findings are likely to impact on future therapeutic strategies aimed towards selecting effective and individually tailored targeted therapies.

. 2011. Hierarchical Bayesian nonparametric mixture models for clustering with variable relevance determination. Bayesian Anal, 6 (2), pp. 329-352. | Show Abstract | Read more

We propose a hierarchical Bayesian nonparametric mixture model for clustering when some of the covariates are assumed to be of varying relevance to the clustering problem. This can be thought of as an issue in variable selection for unsupervised learning. We demonstrate that by defining a hierarchical population based nonparametric prior on the cluster locations scaled by the inverse covariance matrices of the likelihood we arrive at a 'sparsity prior' representation which admits a conditionally conjugate prior. This allows us to perform full Gibbs sampling to obtain posterior distributions over parameters of interest including an explicit measure of each covariate's relevance and a distribution over the number of potential clusters present in the data. This also allows for individual cluster specific variable selection. We demonstrate improved inference on a number of canonical problems.

Cited:

54

Scopus

. 2011. Bayesian Nonparametric Hidden Markov Models with application to the analysis of copy-number-variation in mammalian genomes. J R Stat Soc Series B Stat Methodol, 73 (1), pp. 37-57. | Show Abstract | Read more

We consider the development of Bayesian Nonparametric methods for product partition models such as Hidden Markov Models and change point models. Our approach uses a Mixture of Dirichlet Process (MDP) model for the unknown sampling distribution (likelihood) for the observations arising in each state and a computationally efficient data augmentation scheme to aid inference. The method uses novel MCMC methodology which combines recent retrospective sampling methods with the use of slice sampler variables. The methodology is computationally efficient, both in terms of MCMC mixing properties, and robustness to the length of the time series being investigated. Moreover, the method is easy to implement requiring little or no user-interaction. We apply our methodology to the analysis of genomic copy number variation.

. 2010. Presynaptic NMDARs in the hippocampus facilitate transmitter release at theta frequency. Neuron, 68 (6), pp. 1109-1127. | Show Abstract | Read more

A rise in [Ca(2+)](i) provides the trigger for neurotransmitter release at neuronal boutons. We have used confocal microscopy and Ca(2+) sensitive dyes to directly measure the action potential-evoked [Ca(2+)](i) in the boutons of Schaffer collaterals. This reveals that the trial-by-trial amplitude of the evoked Ca(2+) transient is bimodally distributed. We demonstrate that "large" Ca(2+) transients occur when presynaptic NMDA receptors are activated following transmitter release. Presynaptic NMDA receptor activation proves critical in producing facilitation of transmission at theta frequencies. Because large Ca(2+) transients "report" transmitter release, their frequency on a trial-by-trial basis can be used to estimate the probability of release, p(r). We use this novel estimator to show that p(r) increases following the induction of long-term potentiation.

Cited:

127

Scopus

. 2010. On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. J Comput Graph Stat, 19 (4), pp. 769-789. | Show Abstract | Read more

We present a case-study on the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Processing Units (GPUs), are self-contained parallel computational devices that can be housed in conventional desktop and laptop computers and can be thought of as prototypes of the next generation of many-core processors. For certain classes of population-based Monte Carlo algorithms they offer massively parallel simulation, with the added advantage over conventional distributed multi-core processors that they are cheap, easily accessible, easy to maintain, easy to code, dedicated local devices with low power consumption. On a canonical set of stochastic simulation examples including population-based Markov chain Monte Carlo methods and Sequential Monte Carlo methods, we nd speedups from 35 to 500 fold over conventional single-threaded computer code. Our findings suggest that GPUs have the potential to facilitate the growth of statistical modelling into complex data rich domains through the availability of cheap and accessible many-core computation. We believe the speedup we observe should motivate wider use of parallelizable simulation methods and greater methodological attention to their design.

Cited:

534

Scopus

et al. 2010. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature, 464 (7289), pp. 713-720. | Show Abstract | Read more

Copy number variants (CNVs) account for a major proportion of human genetic polymorphism and have been predicted to have an important role in genetic susceptibility to common disease. To address this we undertook a large, direct genome-wide study of association between CNVs and eight common human diseases. Using a purpose-designed array we typed approximately 19,000 individuals into distinct copy-number classes at 3,432 polymorphic CNVs, including an estimated approximately 50% of all common CNVs larger than 500 base pairs. We identified several biological artefacts that lead to false-positive associations, including systematic CNV differences between DNAs derived from blood and cell lines. Association testing and follow-up replication analyses confirmed three loci where CNVs were associated with disease-IRGM for Crohn's disease, HLA for Crohn's disease, rheumatoid arthritis and type 1 diabetes, and TSPAN8 for type 2 diabetes-although in each case the locus had previously been identified in single nucleotide polymorphism (SNP)-based studies, reflecting our observation that most common CNVs that are well-typed on our array are well tagged by SNPs and so have been indirectly explored through SNP studies. We conclude that common CNVs that can be typed on existing platforms are unlikely to contribute greatly to the genetic basis of common human diseases.

Cited:

93

Scopus

. 2010. A statistical approach for detecting genomic aberrations in heterogeneous tumor samples from single nucleotide polymorphism genotyping data. Genome Biol, 11 (9), pp. R92. | Show Abstract | Read more

We describe a statistical method for the characterization of genomic aberrations in single nucleotide polymorphism microarray data acquired from cancer genomes. Our approach allows us to model the joint effect of polyploidy, normal DNA contamination and intra-tumour heterogeneity within a single unified Bayesian framework. We demonstrate the efficacy of our method on numerous datasets including laboratory generated mixtures of normal-cancer cell lines and real primary tumours.

Cited:

119

Scopus

. 2009. Comparing CNV detection methods for SNP arrays. Brief Funct Genomic Proteomic, 8 (5), pp. 353-366. | Show Abstract | Read more

Data from whole genome association studies can now be used for dual purposes, genotyping and copy number detection. In this review we discuss some of the methods for using SNP data to detect copy number events. We examine a number of algorithms designed to detect copy number changes through the use of signal-intensity data and consider methods to evaluate the changes found. We describe the use of several statistical models in copy number detection in germline samples. We also present a comparison of data using these methods to assess accuracy of prediction and detection of changes in copy number.

Cited:

49

Scopus

. 2008. GenoSNP: a variational Bayes within-sample SNP genotyping algorithm that does not require a reference population. Bioinformatics, 24 (19), pp. 2209-2214. | Show Abstract | Read more

UNLABELLED: Current genotyping algorithms typically call genotypes by clustering allele-specific intensity data on a single nucleotide polymorphism (SNP) by SNP basis. This approach assumes the availability of a large number of control samples that have been sampled on the same array and platform. We have developed a SNP genotyping algorithm for the Illumina Infinium SNP genotyping assay that is entirely within-sample and does not require the need for a population of control samples nor parameters derived from such a population. Our algorithm exhibits high concordance with current methods and >99% call accuracy on HapMap samples. The ability to call genotypes using only within-sample information makes the method computationally light and practical for studies involving small sample sizes and provides a valuable independent quality control metric for other population-based approaches. AVAILABILITY: http://www.stats.ox.ac.uk/~giannoul/GenoSNP/.

. 2008. Akt regulates centrosome migration and spindle orientation in the early Drosophila melanogaster embryo. J Cell Biol, 180 (3), pp. 537-548. | Show Abstract | Read more

Correct positioning and morphology of the mitotic spindle is achieved through regulating the interaction between microtubules (MTs) and cortical actin. Here we find that, in the Drosophila melanogaster early embryo, reduced levels of the protein kinase Akt result in incomplete centrosome migration around cortical nuclei, bent mitotic spindles, and loss of nuclei into the interior of the embryo. We show that Akt is enriched at the embryonic cortex and is required for phosphorylation of the glycogen synthase kinase-3beta homologue Zeste-white 3 kinase (Zw3) and for the cortical localizations of the adenomatosis polyposis coli (APC)-related protein APC2/E-APC and the MT + Tip protein EB1. We also show that reduced levels of Akt result in mislocalization of APC2 in postcellularized embryonic mitoses and misorientation of epithelial mitotic spindles. Together, our results suggest that Akt regulates a complex containing Zw3, Armadillo, APC2, and EB1 and that this complex has a role in stabilizing MT-cortex interactions, facilitating both centrosome separation and mitotic spindle orientation.

Cited:

31

Scopus

. 2008. CNV discovery using SNP genotyping arrays. Cytogenet Genome Res, 123 (1-4), pp. 307-312. | Show Abstract | Read more

Genome-wide single nucleotide polymorphism (SNP) genotyping platforms have made an important contribution to population genetics and genetic epidemiology. Recently there has been a realisation that these SNP platforms can also be used for typing copy number variants (CNVs). This allows for 'generalised' genotyping of both SNPs and CNVs simultaneously on a common sample set, with advantages in terms of cost and unified analysis. In this article we review various statistical approaches to calling CNVs from SNP data. We highlight three tiers of algorithms depending on the level of information used.

Cited:

371

Scopus

. 2007. QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res, 35 (6), pp. 2013-2025. | Show Abstract | Read more

Array-based technologies have been used to detect chromosomal copy number changes (aneuploidies) in the human genome. Recent studies identified numerous copy number variants (CNV) and some are common polymorphisms that may contribute to disease susceptibility. We developed, and experimentally validated, a novel computational framework (QuantiSNP) for detecting regions of copy number variation from BeadArray SNP genotyping data using an Objective Bayes Hidden-Markov Model (OB-HMM). Objective Bayes measures are used to set certain hyperparameters in the priors using a novel re-sampling framework to calibrate the model to a fixed Type I (false positive) error rate. Other parameters are set via maximum marginal likelihood to prior training data of known structure. QuantiSNP provides probabilistic quantification of state classifications and significantly improves the accuracy of segmental aneuploidy identification and mapping, relative to existing analytical tools (Beadstudio, Illumina), as demonstrated by validation of breakpoint boundaries. QuantiSNP identified both novel and validated CNVs. QuantiSNP was developed using BeadArray SNP data but it can be adapted to other platforms and we believe that the OB-HMM framework has widespread applicability in genomic research. In conclusion, QuantiSNP is a novel algorithm for high-resolution CNV/aneuploidy detection with application to clinical genetics, cancer and disease association studies.

. The Alive Particle Filter | Show Abstract

In the following article we develop a particle filter for approximating Feynman-Kac models with indicator potentials. Examples of such models include approximate Bayesian computation (ABC) posteriors associated with hidden Markov models (HMMs) or rare-event problems. Such models require the use of advanced particle filter or Markov chain Monte Carlo (MCMC) algorithms e.g. Jasra et al. (2012), to perform estimation. One of the drawbacks of existing particle filters, is that they may 'collapse', in that the algorithm may terminate early, due to the indicator potentials. In this article, using a special case of the locally adaptive particle filter in Lee et al. (2013), which is closely related to Le Gland & Oudjane (2004), we use an algorithm which can deal with this latter problem, whilst introducing a random cost per-time step. This algorithm is investigated from a theoretical perspective and several results are given which help to validate the algorithms and to provide guidelines for their implementation. In addition, we show how this algorithm can be used within MCMC, using particle MCMC (Andrieu et al. 2010). Numerical examples are presented for ABC approximations of HMMs.

Cited:

23

Scopus

. 2013. OncoSNP-SEQ: a statistical approach for the identification of somatic copy number alterations from next-generation sequencing of cancer genomes. Bioinformatics, 29 (19), pp. 2482-2484. | Show Abstract | Read more

SUMMARY: Recent major cancer genome sequencing studies have used whole-genome sequencing to detect various types of genomic variation. However, a number of these studies have continued to rely on SNP array information to provide additional results for copy number and loss-of-heterozygosity estimation and assessing tumour purity. OncoSNP-SEQ is a statistical model-based approach for inferring copy number profiles directly from high-coverage whole genome sequencing data that is able to account for unknown tumour purity and ploidy. AVAILABILITY: MATLAB code is available at the following URL: https://sites.google.com/site/oncosnpseq/.

Cited:

43

Scopus

et al. 2012. Quantification of subclonal distributions of recurrent genomic aberrations in paired pre-treatment and relapse samples from patients with B-cell chronic lymphocytic leukemia. Leukemia, 26 (7), pp. 1564-1575. | Show Abstract | Read more

Genome-wide array approaches and sequencing analyses are powerful tools for identifying genetic aberrations in cancers, including leukemias and lymphomas. However, the clinical and biological significance of such aberrations and their subclonal distribution are poorly understood. Here, we present the first genome-wide array based study of pre-treatment and relapse samples from patients with B-cell chronic lymphocytic leukemia (B-CLL) that uses the computational statistical tool OncoSNP. We show that quantification of the proportion of copy number alterations (CNAs) and copy neutral loss of heterozygosity regions (cnLOHs) in each sample is feasible. Furthermore, we (i) reveal complex changes in the subclonal architecture of paired samples at relapse compared with pre-treatment, (ii) provide evidence supporting an association between increased genomic complexity and poor clinical outcome (iii) report previously undefined, recurrent CNA/cnLOH regions that expand or newly occur at relapse and therefore might harbor candidate driver genes of relapse and/or chemotherapy resistance. Our findings are likely to impact on future therapeutic strategies aimed towards selecting effective and individually tailored targeted therapies.

Cited:

54

Scopus

. 2011. Bayesian Nonparametric Hidden Markov Models with application to the analysis of copy-number-variation in mammalian genomes. J R Stat Soc Series B Stat Methodol, 73 (1), pp. 37-57. | Show Abstract | Read more

We consider the development of Bayesian Nonparametric methods for product partition models such as Hidden Markov Models and change point models. Our approach uses a Mixture of Dirichlet Process (MDP) model for the unknown sampling distribution (likelihood) for the observations arising in each state and a computationally efficient data augmentation scheme to aid inference. The method uses novel MCMC methodology which combines recent retrospective sampling methods with the use of slice sampler variables. The methodology is computationally efficient, both in terms of MCMC mixing properties, and robustness to the length of the time series being investigated. Moreover, the method is easy to implement requiring little or no user-interaction. We apply our methodology to the analysis of genomic copy number variation.

Cited:

127

Scopus

. 2010. On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. J Comput Graph Stat, 19 (4), pp. 769-789. | Show Abstract | Read more

We present a case-study on the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Processing Units (GPUs), are self-contained parallel computational devices that can be housed in conventional desktop and laptop computers and can be thought of as prototypes of the next generation of many-core processors. For certain classes of population-based Monte Carlo algorithms they offer massively parallel simulation, with the added advantage over conventional distributed multi-core processors that they are cheap, easily accessible, easy to maintain, easy to code, dedicated local devices with low power consumption. On a canonical set of stochastic simulation examples including population-based Markov chain Monte Carlo methods and Sequential Monte Carlo methods, we nd speedups from 35 to 500 fold over conventional single-threaded computer code. Our findings suggest that GPUs have the potential to facilitate the growth of statistical modelling into complex data rich domains through the availability of cheap and accessible many-core computation. We believe the speedup we observe should motivate wider use of parallelizable simulation methods and greater methodological attention to their design.

Cited:

93

Scopus

. 2010. A statistical approach for detecting genomic aberrations in heterogeneous tumor samples from single nucleotide polymorphism genotyping data. Genome Biol, 11 (9), pp. R92. | Show Abstract | Read more

We describe a statistical method for the characterization of genomic aberrations in single nucleotide polymorphism microarray data acquired from cancer genomes. Our approach allows us to model the joint effect of polyploidy, normal DNA contamination and intra-tumour heterogeneity within a single unified Bayesian framework. We demonstrate the efficacy of our method on numerous datasets including laboratory generated mixtures of normal-cancer cell lines and real primary tumours.

Cited:

49

Scopus

. 2008. GenoSNP: a variational Bayes within-sample SNP genotyping algorithm that does not require a reference population. Bioinformatics, 24 (19), pp. 2209-2214. | Show Abstract | Read more

UNLABELLED: Current genotyping algorithms typically call genotypes by clustering allele-specific intensity data on a single nucleotide polymorphism (SNP) by SNP basis. This approach assumes the availability of a large number of control samples that have been sampled on the same array and platform. We have developed a SNP genotyping algorithm for the Illumina Infinium SNP genotyping assay that is entirely within-sample and does not require the need for a population of control samples nor parameters derived from such a population. Our algorithm exhibits high concordance with current methods and >99% call accuracy on HapMap samples. The ability to call genotypes using only within-sample information makes the method computationally light and practical for studies involving small sample sizes and provides a valuable independent quality control metric for other population-based approaches. AVAILABILITY: http://www.stats.ox.ac.uk/~giannoul/GenoSNP/.

Cited:

371

Scopus

. 2007. QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res, 35 (6), pp. 2013-2025. | Show Abstract | Read more

Array-based technologies have been used to detect chromosomal copy number changes (aneuploidies) in the human genome. Recent studies identified numerous copy number variants (CNV) and some are common polymorphisms that may contribute to disease susceptibility. We developed, and experimentally validated, a novel computational framework (QuantiSNP) for detecting regions of copy number variation from BeadArray SNP genotyping data using an Objective Bayes Hidden-Markov Model (OB-HMM). Objective Bayes measures are used to set certain hyperparameters in the priors using a novel re-sampling framework to calibrate the model to a fixed Type I (false positive) error rate. Other parameters are set via maximum marginal likelihood to prior training data of known structure. QuantiSNP provides probabilistic quantification of state classifications and significantly improves the accuracy of segmental aneuploidy identification and mapping, relative to existing analytical tools (Beadstudio, Illumina), as demonstrated by validation of breakpoint boundaries. QuantiSNP identified both novel and validated CNVs. QuantiSNP was developed using BeadArray SNP data but it can be adapted to other platforms and we believe that the OB-HMM framework has widespread applicability in genomic research. In conclusion, QuantiSNP is a novel algorithm for high-resolution CNV/aneuploidy detection with application to clinical genetics, cancer and disease association studies.

1403