Yau Group Publications
Publication details can also be found via Google Scholar
Modeling bifurcations in single-cell transcriptomics data has become an increasingly popular field of research. Several methods have been proposed to infer bifurcation structure from such data, but all rely on heuristic non-probabilistic inference. Here we propose the first generative, fully probabilistic model for such inference based on a Bayesian hierarchical mixture of factor analyzers. Our model exhibits competitive performance on large datasets despite implementing full Markov-Chain Monte Carlo sampling, and its unique hierarchical prior structure enables automatic determination of genes driving the bifurcation process. We additionally propose an Empirical-Bayes like extension that deals with the high levels of zero-inflation in single-cell RNA-seq data and quantify when such models are useful. We apply or model to both real and simulated single-cell gene expression data and compare the results to existing pseudotime methods. Finally, we discuss both the merits and weaknesses of such a unified, probabilistic approach in the context practical bioinformatics analyses.
Up to 10% of cases of gastric cancer are familial, but so far, only mutations in CDH1 have been associated with gastric cancer risk. To identify genetic variants that affect risk for gastric cancer, we collected blood samples from 28 patients with hereditary diffuse gastric cancer (HDGC) not associated with mutations in CDH1 and performed whole-exome sequence analysis. We then analyzed sequences of candidate genes in 333 independent HDGC and non-HDGC cases. We identified 11 cases with mutations in PALB2, BRCA1, or RAD51C genes, which regulate homologous DNA recombination. We found these mutations in 2 of 31 patients with HDGC (6.5%) and 9 of 331 patients with sporadic gastric cancer (2.8%). Most of these mutations had been previously associated with other types of tumors and partially co-segregated with gastric cancer in our study. Tumors that developed in patients with these mutations had a mutation signature associated with somatic homologous recombination deficiency. Our findings indicate that defects in homologous recombination increase risk for gastric cancer.
Motivation: Pseudotime analyses of single-cell RNA-seq data have become increasingly common. Typically, a latent trajectory corresponding to a biological process of interest-such as differentiation or cell cycle-is discovered. However, relatively little attention has been paid to modelling the differential expression of genes along such trajectories. Results: We present switchde , a statistical framework and accompanying R package for identifying switch-like differential expression of genes along pseudotemporal trajectories. Our method includes fast model fitting that provides interpretable parameter estimates corresponding to how quickly a gene is up or down regulated as well as where in the trajectory such regulation occurs. It also reports a P -value in favour of rejecting a constant-expression model for switch-like differential expression and optionally models the zero-inflation prevalent in single-cell data. Availability and Implementation: The R package switchde is available through the Bioconductor project at https://bioconductor.org/packages/switchde . Contact: email@example.com. Supplementary information: Supplementary data are available at Bioinformatics online.
Single cell gene expression profiling can be used to quantify transcriptional dynamics in temporal processes, such as cell differentiation, using computational methods to label each cell with a 'pseudotime' where true time series experimentation is too difficult to perform. However, owing to the high variability in gene expression between individual cells, there is an inherent uncertainty in the precise temporal ordering of the cells. Pre-existing methods for pseudotime estimation have predominantly given point estimates precluding a rigorous analysis of the implications of uncertainty. We use probabilistic modelling techniques to quantify pseudotime uncertainty and propagate this into downstream differential expression analysis. We demonstrate that reliance on a point estimate of pseudotime can lead to inflated false discovery rates and that probabilistic approaches provide greater robustness and measures of the temporal resolution that can be obtained from pseudotime inference.
The adipocyte-rich microenvironment forms a niche for ovarian cancer metastasis, but the mechanisms driving this process are incompletely understood. Here we show that salt-inducible kinase 2 (SIK2) is overexpressed in adipocyte-rich metastatic deposits compared with ovarian primary lesions. Overexpression of SIK2 in ovarian cancer cells promotes abdominal metastasis while SIK2 depletion prevents metastasis in vivo. Importantly, adipocytes induce calcium-dependent activation and autophosphorylation of SIK2. Activated SIK2 plays a dual role in augmenting AMPK-induced phosphorylation of acetyl-CoA carboxylase and in activating the PI3K/AKT pathway through p85α-S154 phosphorylation. These findings identify SIK2 at the apex of the adipocyte-induced signaling cascades in cancer cells and make a compelling case for targeting SIK2 for therapy in ovarian cancer.
Current screening methods for ovarian cancer can only detect advanced disease. Earlier detection has proved difficult because the molecular precursors involved in the natural history of the disease are unknown. To identify early driver mutations in ovarian cancer cells, we used dense whole genome sequencing of micrometastases and microscopic residual disease collected at three time points over three years from a single patient during treatment for high-grade serous ovarian cancer (HGSOC). The functional and clinical significance of the identified mutations was examined using a combination of population-based whole genome sequencing, targeted deep sequencing, multi-center analysis of protein expression, loss of function experiments in an in-vivo reporter assay and mammalian models, and gain of function experiments in primary cultured fallopian tube epithelial (FTE) cells. We identified frequent mutations involving a 40kb distal repressor region for the key stem cell differentiation gene SOX2. In the apparently normal FTE, the region was also mutated. This was associated with a profound increase in SOX2 expression (p<2(-16)), which was not found in patients without cancer (n=108). Importantly, we show that SOX2 overexpression in FTE is nearly ubiquitous in patients with HGSOCs (n=100), and common in BRCA1-BRCA2 mutation carriers (n=71) who underwent prophylactic salpingo-oophorectomy. We propose that the finding of SOX2 overexpression in FTE could be exploited to develop biomarkers for detecting disease at a premalignant stage, which would reduce mortality from this devastating disease.
© 2016 The Author(s). Published with license by Taylor & Francis. Hidden Markov models (HMMs) are one of the most widely used statistical methods for analyzing sequence data. However, the reporting of output from HMMs has largely been restricted to the presentation of the most-probable (MAP) hidden state sequence, found via the Viterbi algorithm, or the sequence of most probable marginals using the forward–backward algorithm. In this article, we expand the amount of information we could obtain from the posterior distribution of an HMM by introducing linear-time dynamic programming recursions that, conditional on a user-specified constraint in the number of segments, allow us to (i) find MAP sequences, (ii) compute posterior probabilities, and (iii) simulate sample paths. We collectively call these recursions k-segment algorithms and illustrate their utility using simulated and real examples. We also highlight the prospective and retrospective use of k-segment constraints for fitting HMMs or exploring existing model fits. Supplementary materials for this article are available online.
BACKGROUND: Advances in single cell genomics provide a way of routinely generating transcriptomics data at the single cell level. A frequent requirement of single cell expression analysis is the identification of novel patterns of heterogeneity across single cells that might explain complex cellular states or tissue composition. To date, classical statistical analysis tools have being routinely applied, but there is considerable scope for the development of novel statistical approaches that are better adapted to the challenges of inferring cellular hierarchies. RESULTS: We have developed a novel agglomerative clustering method that we call pcaReduce to generate a cell state hierarchy where each cluster branch is associated with a principal component of variation that can be used to differentiate two cell states. Using two real single cell datasets, we compared our approach to other commonly used statistical techniques, such as K-means and hierarchical clustering. We found that pcaReduce was able to give more consistent clustering structures when compared to broad and detailed cell type labels. CONCLUSIONS: Our novel integration of principal components analysis and hierarchical clustering establishes a connection between the representation of the expression data and the number of cell types that can be discovered. In doing so we found that pcaReduce performs better than either technique in isolation in terms of characterising putative cell states. Our methodology is complimentary to other single cell clustering techniques and adds to a growing palette of single cell bioinformatics tools for profiling heterogeneous cell populations.
Single-cell RNA-seq data allows insight into normal cellular function and various disease states through molecular characterization of gene expression on the single cell level. Dimensionality reduction of such high-dimensional data sets is essential for visualization and analysis, but single-cell RNA-seq data are challenging for classical dimensionality-reduction methods because of the prevalence of dropout events, which lead to zero-inflated data. Here, we develop a dimensionality-reduction method, (Z)ero (I)nflated (F)actor (A)nalysis (ZIFA), which explicitly models the dropout characteristics, and show that it improves modeling accuracy on simulated and biological data sets.
To assess factors influencing the success of whole-genome sequencing for mainstream clinical diagnosis, we sequenced 217 individuals from 156 independent cases or families across a broad spectrum of disorders in whom previous screening had identified no pathogenic variants. We quantified the number of candidate variants identified using different strategies for variant calling, filtering, annotation and prioritization. We found that jointly calling variants across samples, filtering against both local and external databases, deploying multiple annotation tools and using familial transmission above biological plausibility contributed to accuracy. Overall, we identified disease-causing variants in 21% of cases, with the proportion increasing to 34% (23/68) for mendelian disorders and 57% (8/14) in family trios. We also discovered 32 potentially clinically actionable variants in 18 genes unrelated to the referral disorder, although only 4 were ultimately considered reportable. Our results demonstrate the value of genome sequencing for routine clinical diagnosis but also highlight many outstanding challenges.
BLOOD, 124 (21),2014. The Identification of Further Minimal Regions of Overlap in Chronic Lymphocytic Leukemia Using High-Resolution SNP Arrays
Journal of Computational and Graphical Statistics, 23 (4), pp. 1143-1162. | Read more2014. A Sequential Algorithm for Fast Fitting of Dirichlet Process Mixture Models
Haematologica, 99 (10), pp. e201-e204. | Citations: 4 (Scopus) | Read more2014. Erythrocytosis associated with a novel missense mutation in the BPGM gene.
Bladder cancers are a leading cause of death from malignancy. Molecular markers might predict disease progression and behaviour more accurately than the available prognostic factors. Here we use whole-genome sequencing to identify somatic mutations and chromosomal changes in 14 bladder cancers of different grades and stages. As well as detecting the known bladder cancer driver mutations, we report the identification of recurrent protein-inactivating mutations in CDKN1A and FAT1. The former are not mutually exclusive with TP53 mutations or MDM2 amplification, showing that CDKN1A dysfunction is not simply an alternative mechanism for p53 pathway inactivation. We find strong positive associations between higher tumour stage/grade and greater clonal diversity, the number of somatic mutations and the burden of copy number changes. In principle, the identification of sub-clones with greater diversity and/or mutation burden within early-stage or low-grade tumours could identify lesions with a high risk of invasive progression.
OBJECTIVES:Microsatellite instability (MSI) is an established marker of good prognosis in colorectal cancer (CRC). Chromosomal instability (CIN) is strongly negatively associated with MSI and has been shown to be a marker of poor prognosis in a small number of studies. However, a substantial group of double-negative (MSI-/CIN-) CRCs exists. The prognosis of these patients is unclear. Furthermore, MSI and CIN are each associated with specific molecular changes, such as mutations in KRAS and BRAF, that have been associated with prognosis. It is not known which of MSI, CIN, and the specific gene mutations are primary predictors of survival.METHODS:We evaluated the prognostic value (disease-free survival, DFS) of CIN, MSI, mutations in KRAS, NRAS, BRAF, PIK3CA, FBXW7, and TP53, and chromosome 18q loss-of-heterozygosity (LOH) in 822 patients from the VICTOR trial of stage II/III CRC. We followed up promising associations in an Australian community-based cohort (N=375).RESULTS:In the VICTOR patients, no specific mutation was associated with DFS, but individually MSI and CIN showed significant associations after adjusting for stage, age, gender, tumor location, and therapy. A combined analysis of the VICTOR and community-based cohorts showed that MSI and CIN were independent predictors of DFS (for MSI, hazard ratio (HR)=0.58, 95% confidence interval (CI) 0.36-0.93, and P=0.021; for CIN, HR=1.54, 95% CI 1.14-2.08, and P=0.005), and joint CIN/MSI testing significantly improved the prognostic prediction of MSI alone (P=0.028). Higher levels of CIN were monotonically associated with progressively poorer DFS, and a semi-quantitative measure of CIN was a better predictor of outcome than a simple CIN+/-variable. All measures of CIN predicted DFS better than the recently described Watanabe LOH ratio.CONCLUSIONS:MSI and CIN are independent predictors of DFS for stage II/III CRC. Prognostic molecular tests for CRC relapse should currently use MSI and a quantitative measure of CIN rather than specific gene mutations.
Bioinformatics, 29 (19), pp. 2482-2484. | Read more2013. OncoSNP-SEQ: a statistical approach for the identification of somatic copy number alterations from next-generation sequencing of cancer genomes
OBJECTIVES: Microsatellite instability (MSI) is an established marker of good prognosis in colorectal cancer (CRC). Chromosomal instability (CIN) is strongly negatively associated with MSI and has been shown to be a marker of poor prognosis in a small number of studies. However, a substantial group of "double-negative" (MSI-/CIN-) CRCs exists. The prognosis of these patients is unclear. Furthermore, MSI and CIN are each associated with specific molecular changes, such as mutations in KRAS and BRAF, that have been associated with prognosis. It is not known which of MSI, CIN, and the specific gene mutations are primary predictors of survival. METHODS: We evaluated the prognostic value (disease-free survival, DFS) of CIN, MSI, mutations in KRAS, NRAS, BRAF, PIK3CA, FBXW7, and TP53, and chromosome 18q loss-of-heterozygosity (LOH) in 822 patients from the VICTOR trial of stage II/III CRC. We followed up promising associations in an Australian community-based cohort (N=375). RESULTS: In the VICTOR patients, no specific mutation was associated with DFS, but individually MSI and CIN showed significant associations after adjusting for stage, age, gender, tumor location, and therapy. A combined analysis of the VICTOR and community-based cohorts showed that MSI and CIN were independent predictors of DFS (for MSI, hazard ratio (HR)=0.58, 95% confidence interval (CI) 0.36-0.93, and P=0.021; for CIN, HR=1.54, 95% CI 1.14-2.08, and P=0.005), and joint CIN/MSI testing significantly improved the prognostic prediction of MSI alone (P=0.028). Higher levels of CIN were monotonically associated with progressively poorer DFS, and a semi-quantitative measure of CIN was a better predictor of outcome than a simple CIN+/- variable. All measures of CIN predicted DFS better than the recently described Watanabe LOH ratio. CONCLUSIONS: MSI and CIN are independent predictors of DFS for stage II/III CRC. Prognostic molecular tests for CRC relapse should currently use MSI and a quantitative measure of CIN rather than specific gene mutations.
This paper is concerned with statistical methods for the segmental classification of linear sequence data where the task is to segment and classify the data according to an underlying hidden discrete state sequence. Such analysis is commonplace in the empirical sciences including genomics, finance and speech processing. In particular, we are interested in answering the following question: given data y and a statistical model π(x, y) of the hidden states x, what should we report as the prediction x̂ under the posterior distribution π(x|y)? That is, how should you make a prediction of the underlying states? We demonstrate that traditional approaches such as reporting the most probable state sequence or most probable set of marginal predictions can give undesirable classification artefacts and offer limited control over the properties of the prediction. We propose a decision theoretic approach using a novel class of Markov loss functions and report x̂ via the principle of minimum expected loss (maximum expected utility).We demonstrate that the sequence of minimum expected loss under the Markov loss function can be enumerated exactly using dynamic programming methods and that it offers flexibility and performance improvements over existing techniques. The result is generic and applicable to any probabilistic model on a sequence, such as Hidden Markov models, change point or product partition models. © Institute of Mathematical Statistics, 2013.
MOTIVATION: The identification of nucleosomes along the chromatin is key to understanding their role in the regulation of gene expression and other DNA-related processes. However, current experimental methods (MNase-ChIP, MNase-Seq) sample nucleosome positions from a cell population and contain biases, making thus the precise identification of individual nucleosomes not straightforward. Recent works have only focused on the first point, where noise reduction approaches have been developed to identify nucleosome positions. RESULTS: In this article, we propose a new approach, termed NucleoFinder, that addresses both the positional heterogeneity across cells and experimental biases by seeking nucleosomes consistently positioned in a cell population and showing a significant enrichment relative to a control sample. Despite the absence of validated dataset, we show that our approach (i) detects fewer false positives than two other nucleosome calling methods and (ii) identifies two important features of the nucleosome organization (the nucleosome spacing downstream of active promoters and the enrichment/depletion of GC/AT dinucleotides at the centre of in vitro nucleosomes) with equal or greater ability than the other two methods.
BACKGROUND: Prevalence of colorectal cancer (CRC) in the British Bangladeshi population (BAN) is low compared to British Caucasians (CAU). Genetic background may influence mutations and disease features. METHODS: We characterized the clinicopathological features of BAN CRCs and interrogated their genomes using mutation profiling and high-density single nucleotide polymorphism (SNP) arrays and compared findings to CAU CRCs. RESULTS: Age of onset of BAN CRC was significantly lower than for CAU patients (p=3.0 x 10-5) and this difference was not due to Lynch syndrome or the polyposis syndromes. KRAS mutations in BAN microsatellite stable (MSS) CRCs were comparatively rare (5.4%) compared to CAU MSS CRCs (25%; p=0.04), which correlates with the high percentage of mucinous histotype observed (31%) in the BAN samples. No BRAF mutations was seen in our BAN MSS CRCs (CAU CRCs, 12%; p=0.08). Array data revealed similar patterns of gains (chromosome 7 and 8q), losses (8p, 17p and 18q) and LOH (4q, 17p and 18q) in BAN and CAU CRCs. A small deletion on chromosome 16p13.2 involving the alternative splicing factor RBFOX1 only was found in significantly more BAN (50%) than CAU CRCs (15%) cases (p=0.04). Focal deletions targeting the 5' end of the gene were also identified. Novel RBFOX1 mutations were found in CRC cell lines and tumours; mRNA and protein expression was reduced in tumours. CONCLUSIONS: KRAS mutations were rare in BAN MSS CRC and a mucinous histotype common. Loss of RBFOX1 may explain the anomalous splicing activity associated with CRC.
Genome-wide array approaches and sequencing analyses are powerful tools for identifying genetic aberrations in cancers, including leukemias and lymphomas. However, the clinical and biological significance of such aberrations and their subclonal distribution are poorly understood. Here, we present the first genome-wide array based study of pre-treatment and relapse samples from patients with B-cell chronic lymphocytic leukemia (B-CLL) that uses the computational statistical tool OncoSNP. We show that quantification of the proportion of copy number alterations (CNAs) and copy neutral loss of heterozygosity regions (cnLOHs) in each sample is feasible. Furthermore, we (i) reveal complex changes in the subclonal architecture of paired samples at relapse compared with pre-treatment, (ii) provide evidence supporting an association between increased genomic complexity and poor clinical outcome (iii) report previously undefined, recurrent CNA/cnLOH regions that expand or newly occur at relapse and therefore might harbor candidate driver genes of relapse and/or chemotherapy resistance. Our findings are likely to impact on future therapeutic strategies aimed towards selecting effective and individually tailored targeted therapies.
We propose a hierarchical Bayesian nonparametric mixture model for clustering when some of the covariates are assumed to be of varying relevance to the clustering problem. This can be thought of as an issue in variable selection for unsupervised learning. We demonstrate that by defining a hierarchical population based nonparametric prior on the cluster locations scaled by the inverse covariance matrices of the likelihood we arrive at a 'sparsity prior' representation which admits a conditionally conjugate prior. This allows us to perform full Gibbs sampling to obtain posterior distributions over parameters of interest including an explicit measure of each covariate's relevance and a distribution over the number of potential clusters present in the data. This also allows for individual cluster specific variable selection. We demonstrate improved inference on a number of canonical problems.
We consider the development of Bayesian Nonparametric methods for product partition models such as Hidden Markov Models and change point models. Our approach uses a Mixture of Dirichlet Process (MDP) model for the unknown sampling distribution (likelihood) for the observations arising in each state and a computationally efficient data augmentation scheme to aid inference. The method uses novel MCMC methodology which combines recent retrospective sampling methods with the use of slice sampler variables. The methodology is computationally efficient, both in terms of MCMC mixing properties, and robustness to the length of the time series being investigated. Moreover, the method is easy to implement requiring little or no user-interaction. We apply our methodology to the analysis of genomic copy number variation.
A rise in [Ca(2+)](i) provides the trigger for neurotransmitter release at neuronal boutons. We have used confocal microscopy and Ca(2+) sensitive dyes to directly measure the action potential-evoked [Ca(2+)](i) in the boutons of Schaffer collaterals. This reveals that the trial-by-trial amplitude of the evoked Ca(2+) transient is bimodally distributed. We demonstrate that "large" Ca(2+) transients occur when presynaptic NMDA receptors are activated following transmitter release. Presynaptic NMDA receptor activation proves critical in producing facilitation of transmission at theta frequencies. Because large Ca(2+) transients "report" transmitter release, their frequency on a trial-by-trial basis can be used to estimate the probability of release, p(r). We use this novel estimator to show that p(r) increases following the induction of long-term potentiation.
We present a case-study on the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Processing Units (GPUs), are self-contained parallel computational devices that can be housed in conventional desktop and laptop computers and can be thought of as prototypes of the next generation of many-core processors. For certain classes of population-based Monte Carlo algorithms they offer massively parallel simulation, with the added advantage over conventional distributed multi-core processors that they are cheap, easily accessible, easy to maintain, easy to code, dedicated local devices with low power consumption. On a canonical set of stochastic simulation examples including population-based Markov chain Monte Carlo methods and Sequential Monte Carlo methods, we nd speedups from 35 to 500 fold over conventional single-threaded computer code. Our findings suggest that GPUs have the potential to facilitate the growth of statistical modelling into complex data rich domains through the availability of cheap and accessible many-core computation. We believe the speedup we observe should motivate wider use of parallelizable simulation methods and greater methodological attention to their design.
Copy number variants (CNVs) account for a major proportion of human genetic polymorphism and have been predicted to have an important role in genetic susceptibility to common disease. To address this we undertook a large, direct genome-wide study of association between CNVs and eight common human diseases. Using a purpose-designed array we typed approximately 19,000 individuals into distinct copy-number classes at 3,432 polymorphic CNVs, including an estimated approximately 50% of all common CNVs larger than 500 base pairs. We identified several biological artefacts that lead to false-positive associations, including systematic CNV differences between DNAs derived from blood and cell lines. Association testing and follow-up replication analyses confirmed three loci where CNVs were associated with disease-IRGM for Crohn's disease, HLA for Crohn's disease, rheumatoid arthritis and type 1 diabetes, and TSPAN8 for type 2 diabetes-although in each case the locus had previously been identified in single nucleotide polymorphism (SNP)-based studies, reflecting our observation that most common CNVs that are well-typed on our array are well tagged by SNPs and so have been indirectly explored through SNP studies. We conclude that common CNVs that can be typed on existing platforms are unlikely to contribute greatly to the genetic basis of common human diseases.
We describe a statistical method for the characterization of genomic aberrations in single nucleotide polymorphism microarray data acquired from cancer genomes. Our approach allows us to model the joint effect of polyploidy, normal DNA contamination and intra-tumour heterogeneity within a single unified Bayesian framework. We demonstrate the efficacy of our method on numerous datasets including laboratory generated mixtures of normal-cancer cell lines and real primary tumours.
Data from whole genome association studies can now be used for dual purposes, genotyping and copy number detection. In this review we discuss some of the methods for using SNP data to detect copy number events. We examine a number of algorithms designed to detect copy number changes through the use of signal-intensity data and consider methods to evaluate the changes found. We describe the use of several statistical models in copy number detection in germline samples. We also present a comparison of data using these methods to assess accuracy of prediction and detection of changes in copy number.
UNLABELLED: Current genotyping algorithms typically call genotypes by clustering allele-specific intensity data on a single nucleotide polymorphism (SNP) by SNP basis. This approach assumes the availability of a large number of control samples that have been sampled on the same array and platform. We have developed a SNP genotyping algorithm for the Illumina Infinium SNP genotyping assay that is entirely within-sample and does not require the need for a population of control samples nor parameters derived from such a population. Our algorithm exhibits high concordance with current methods and >99% call accuracy on HapMap samples. The ability to call genotypes using only within-sample information makes the method computationally light and practical for studies involving small sample sizes and provides a valuable independent quality control metric for other population-based approaches. AVAILABILITY: http://www.stats.ox.ac.uk/~giannoul/GenoSNP/.
Correct positioning and morphology of the mitotic spindle is achieved through regulating the interaction between microtubules (MTs) and cortical actin. Here we find that, in the Drosophila melanogaster early embryo, reduced levels of the protein kinase Akt result in incomplete centrosome migration around cortical nuclei, bent mitotic spindles, and loss of nuclei into the interior of the embryo. We show that Akt is enriched at the embryonic cortex and is required for phosphorylation of the glycogen synthase kinase-3beta homologue Zeste-white 3 kinase (Zw3) and for the cortical localizations of the adenomatosis polyposis coli (APC)-related protein APC2/E-APC and the MT + Tip protein EB1. We also show that reduced levels of Akt result in mislocalization of APC2 in postcellularized embryonic mitoses and misorientation of epithelial mitotic spindles. Together, our results suggest that Akt regulates a complex containing Zw3, Armadillo, APC2, and EB1 and that this complex has a role in stabilizing MT-cortex interactions, facilitating both centrosome separation and mitotic spindle orientation.
Genome-wide single nucleotide polymorphism (SNP) genotyping platforms have made an important contribution to population genetics and genetic epidemiology. Recently there has been a realisation that these SNP platforms can also be used for typing copy number variants (CNVs). This allows for 'generalised' genotyping of both SNPs and CNVs simultaneously on a common sample set, with advantages in terms of cost and unified analysis. In this article we review various statistical approaches to calling CNVs from SNP data. We highlight three tiers of algorithms depending on the level of information used.
Array-based technologies have been used to detect chromosomal copy number changes (aneuploidies) in the human genome. Recent studies identified numerous copy number variants (CNV) and some are common polymorphisms that may contribute to disease susceptibility. We developed, and experimentally validated, a novel computational framework (QuantiSNP) for detecting regions of copy number variation from BeadArray SNP genotyping data using an Objective Bayes Hidden-Markov Model (OB-HMM). Objective Bayes measures are used to set certain hyperparameters in the priors using a novel re-sampling framework to calibrate the model to a fixed Type I (false positive) error rate. Other parameters are set via maximum marginal likelihood to prior training data of known structure. QuantiSNP provides probabilistic quantification of state classifications and significantly improves the accuracy of segmental aneuploidy identification and mapping, relative to existing analytical tools (Beadstudio, Illumina), as demonstrated by validation of breakpoint boundaries. QuantiSNP identified both novel and validated CNVs. QuantiSNP was developed using BeadArray SNP data but it can be adapted to other platforms and we believe that the OB-HMM framework has widespread applicability in genomic research. In conclusion, QuantiSNP is a novel algorithm for high-resolution CNV/aneuploidy detection with application to clinical genetics, cancer and disease association studies.
Total publications on this page: 32
Total citations for publications on this page: 1728