`

Zamin Iqbal

 

I am a Postdoctoral Researcher in the McVean group at the Wellcome Trust Centre for Human Genetics, in Oxford.

I work on algorithms, mathematical models and software for the analysis of genetic variation using high-throughput sequencing data. I co-wrote Cortex, a low-memory multi-sample de Bruijn genome assembler; my main focus is the use of de novo assembly in variation analysis of various forms.

I am particularly interested in understanding the complex and highly diverse gene families associated with immune systems, such as the Major Histocompatibility Complex in humans. I also have an interest in the application of population sequencing data to the study of the evolution and epidemiology of pathogens, especially the malaria parasite, Plasmodium falciparum.

I am part of the 1000 Genomes Consortium, and have used polymorphism-aware de novo assembly extensively to analyse human individual and population data.

Research:

Very Big Picture:

The determination of genome sequence and the distinction of genomic differences between species, and between individuals within a species, are of fundamental importance in biology, laying the foundation for resolving functional mechanisms, for finding correlation between genotype and phenotype, and for building and using models of evolution and selection. New high-throughput sequencing technologies are revolutionising our ability to analyse genomes, but as our ambitions grow, so do the computational and statistical challenges.

Variant discovery up to now - limitations:

The standard approach for discovery of genetic variants within a species requires the mapping of sequence reads to a reference genome. This assumption that all individuals are almost identical to the reference leads to problems where the assumption is false, or where the two haplotypes in a diploid organism are significantly different. Even with a high quality reference such as that for humans (which has had thousands of man-hours put into it and still has teams of people improving it), it is known that mapping algorithms exhibit a reference-bias. In fact we have shown (see our Cortex/Nature Genetics paper referenced below) that there is a lot of human genetic variation that is invisible to reference-based approaches - partly because some variants are so complex, mappers cannot place reads correctly, and partly because the reference is missing the sequence it "should" map to. Furthermore, there are medically important regions of the genome (for example the HLA and KIR genes - as I said above, I am particularly interested in these) which have undergone extremely complicated evolutionary histories of recombination and selection, which are massively polymorphic, where mapping approaches are unreliable, and where no single-individual reference can be complete.

A new approach:

I developed the Cortex assembler in collaboration with Mario Caccamo and Gil McVean,  the first genome assembler capable of handling multiple eukaryote genomes at the same time, and also the first to engage with polymorphism. Variant discovery can be done directly by assembling multiple samples simultaneously and directly comparing their genomes without any attempt to build a full consensus genome assembly. Not only does this circumvent issues involving reference-errors, but it is more successful at calling larger and more complex variants than just SNPs and small indels (see the Nature Genetics paper mentioned below for details, including a comparison of Cortex calls with 1000 Genomes calls from the pilot paper). To give a rough idea, Cortex can assemble 10 humans on a 256Gb RAM server, about 1000 S. cerevisiae samples in 64Gb of RAM, or over 2000 S.aureus samples in 32Gb of RAM.

We have also developed a mathematical model extending the Lander-Waterman model, which predicts power to detect a variant of given length, depending on experimental parameters (read-length and coverage), informatic parameters (de Bruijn graph k-mer) and biological parameters (repeat content of the specific genome under study). The predictions of this model match results of simulation and empirical data (human, chimpanzee). This enables a researcher to tailor the design of their experiment to match their goals. For details, and examples of how different aims would lead to different experimental designs, see the Supplementary Note to our Nature Genetics paper.

Finally, and in some ways most interestingly, we can profit from this ability to simultaneously assemble multiple samples, and distinguish between polymorphism, repeats, and sequencing errors by looking at their segregation statistics in the population. Typically repeats, paralogues and polymorphisms cause problems for assemblers. These are attacked by purely technical (computational or library prep) methods. What we have shown is that by changing the original problem (doing assembly on many samples, not just one), we can integrate in a completely new source of information (population genetics) to improve our results. Not only does this show the way for new methods of improving single-individual assembly, but it also opens up the possibility of studying any species (not just one with a reference) using high-throughput sequencing data.

Email: zam@well.ox.ac.uk

Selected Publications:

Senn H, Ogden R, Cezard T, Gharbi K, Iqbal Z, Johnson E, Kamps-Hughes N, Rosell F, McEwing R.
Reference-free SNP discovery for the Eurasian beaver from restriction site-associated DNA paired-end data.
Mol Ecol. (2013) Jun;22(11):3141-50. doi: 10.1111/mec.12242

Z Iqbal, I Turner, G McVean
High-throughput microbial population genomics using the Cortex variation assembler.
Bioinformatics. 2012 ; doi: 10.1093/bioinformatics/bts673

The 1000 Genomes Consortium
An integrated map of genetic variation from 1,092 human genomes.
Nature. 2012 ; http://dx.doi.org/10.1038/nature11632

Adam Auton, Adi Fledel-Alon, Susanne Pfeifer, Oliver Venn, Laure Ségurel, Teresa Street, Ellen M. Leffler, Rory Bowden, Ivy Aneas, John Broxholme, Peter Humburg, Zamin Iqbal, Gerton Lunter, Julian Maller, Ryan D. Hernandez, Cord Melton, Aarti Venkat, Marcelo A. Nobrega, Ronald Bontrop, Simon Myers, Peter Donnelly, Molly Przeworski, Gil McVean.
A Fine-Scale Chimpanzee Genetic Map from Population Sequencing
Science. 2012 Mar 15

Bernadette C. Young, Tanya Golubchik, Elizabeth M. Batty, Rowena Fung, Hanna Larner-Svensson, Antonina A. Votintseva, Ruth R. Miller, Heather Godwin, Kyle Knox, Richard G. Everitt, Zamin Iqbal, Andrew J. Rimmer, Madeleine Cule, Camilla L. C. Ip, Xavier Didelot, Rosalind M. Harding, Peter Donnelly, Tim E. Peto, Derrick W. Crook, Rory Bowden, and Daniel J. Wilson.
Evolutionary dynamics of Staphylococcus aureus during progression from carriage to disease.
Proc. Nat. Acad. Sci (2012) (doi:10.1073/pnas.1113219109)

Z Iqbal*, M Caccamo*, I Turner, P Flicek, G McVean.
De novo assembly and genotyping of variants using colored de Bruijn graphs.
Nat Genet. 2012 Jan 8;44(2):226-32. doi: 10.1038/ng.1028. (link)

RE Mills, K Walter, C Stewart, RE Handsaker, K Chen, C Alkan, A Abyzov,SC Yoon, K Ye, RK Cheetham, A Chinwalla, DF Conrad, Y Fu, F Grubert, I Hajirasouliha, F Hormozdiari, LM Iakoucheva, Z Iqbal, S Kang, JM Kidd, MK Konkel, J Korn, E Khurana, D Kural, HY Lam, J Leng, R Li, Y Li, CY Lin, R Luo, XJ Mu, J Nemesh, HE Peckham, T Rausch, A Scally, X Shi, MP Stromberg, AM Stütz, AE Urban, JA Walker, J Wu, Y Zhang, ZD Zhang, MA Batzer, L Ding, GT Marth, G McVean, J Sebat, M Snyder, J Wang, K Ye, EE Eichler, MB Gerstein, ME Hurles, C Lee, SA McCarroll, JO Korbel, 1000 Genomes Project.
Mapping copy number variation by population-scale genome sequencing.
Nature 470:59-65

The 1000 Genomes Project Consortium
A map of human genome variation from population-scale sequencing
Nature, October 2010

 

-->