Si Quang LESenior Statistician (Feb 2011- present): MalariaGen, Wellcome Trust Centre for Human Genetics, University of Oxford. Postdoctoral Fellowship (May 2008-Jan 2011): 1000 Genomes Project, Wellcome Trust Sanger Institute and Microsoft Research Center, Cambridge, UK One main target of 1000 Genomes Project is to sequence 1000 samples at low coverage then to combine data across samples to detect shared variants. We developed methods to discover and genotype single-nucleotide polymorphism (SNP) sites from low coverage sequencing data, making use of shared haplotype (linkage disequilibrium) information presented by ancestral recombination graphs. For each population, we first collect SNP candidates and refine the posterior probability of SNP candidates by considering possible mutations at internal branches of the 40 marginal ancestral trees inferred from the 20 ARGs at the left and right flanking genotype sites. The methods are also applicable to indel genotyping. SNP and indel calls of our methods are integrated with calls from Broad and University of Michican to be the official release of 1000 Genomes Project. Software to implement the methods is available in the QCALL package from www.sanger.ac.uk/software/QCALL/. Postdoctoral Fellowship (Oct 2005 - May 2008), Probabilistic and mixture models in phylogenetics, Montpellier Laboratory of Informatics, Robotics, and Microelectronics, Montpellier, France. Amino acid replacement models are an essential basis of protein phylogenetics. They are used to compute substitution probabilities along phylogeny branches and thus the likelihood of the data as all as in protein alignment. Substitution processes vary depending on the structural configuration of the protein residues as well as specific species. We developed a method to estimate amino acid replacement models from large data based on the likelihood approach with Gamma rate variant. We applied the methods to Pfam protein data to generate a general LG model and to specific influenza data to estimate FLU model. We also introduced structured based models (EX, EHO, and EX_EHO) estimated from HSSP databases, unsupervised models UL2 and UL3, and empirical profile models, C10-C60. We proved by empirical experiments that all of these models outperform other traditional models such as WAG, JTT, etc. We developed and maintain a pipeline to estimate amino acid models from protein alignments as well as many software packages coming together with these models. PhD dissertation (Oct 2002- Sept 2005), Similarity for complex data, Japan Advanced Institute of Science and Technology, Japan My dissertation focused on similarity measures for heterogeneous data, including structured or unstructured data. I developed mathematic models for different data types and integrated these models using Fisher’s transformation. The method is applied to categorical data, mix heterogeneous data (numerical, categorical, ordered data, etc.) and graph structure data of 2D chemical compound. The measures show their outperformance in comparing with other measures in nearest neighbor classification methods. Hepatitis Association study (Oct 2002-Sept 2005), Japan Advanced Institute of Science and Technology, Japan The hepatitis temporal database collected at Chiba university hospital, Japan, included patients corresponding to 983 tests represented as sequences of irregular timestamp points with different lengths. We presented a temporal abstraction approach to find associations between temporal abstractions and patient status as well as treatment efficiency. The associations are not only confirmed by medical doctors but also bring novel understanding of hepatitis diseases as well as treatments. |
|||||
|