|
|
SNP Selection across whole chromosomesbig_haplotypeFor the analysis of very large data-sets (for instance whole chromosomes), and modified approach is required. The basic problem is that if the number of people (and hence number of chromosomes) typed is of the same order as the number of markers typed, and if some of the markers are widely spaced so that they are not in Linkage Disequilibrium, then it is possible to choose a subset of markers that explain all the observed haplotypic diversity, but that are useless in the sense that their selection is contingent upon the accidental long-range haplotypic structure present in the dataset. big_haplotype works slightly differently from span_haplotype and greedy_haplotype. It is used to process very long regions such as whole chromosomes, by moving a window along the genome and performing a greedy SNP search within each window. Each SNP is given a score, and then the best SNPs are selected by their score. The result is a global set of SNPs that uniformly approximate to the local haplotypic structure. A by-product is the calculation of entropy for each window which gives a local measure of linkage disequilibium (where low entropy implies high LD). In detail, the following steps are performed: Suppose we have N SNPs and window size w, and we wish to identify the top pN SNPs, where p is a values like 0.5
In the example below we use a set of 899 SNPs typed across 80 haplotypes for Human Chromosome 19. [Data kindly provided by Lon Cardon - see Phillips MS, et al (2003) Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots. Nat Genet. 33(3):382-7.] The plot shows the effects of selecting the best 50% of SNPs. A window size of 15 was used. The total entropy (blue) in each window is a measure of the LD in that region (where low entropy means high LD); the partial entropy (green) is the entropy acheived by typing the selected markers; the fractional entropy (purple) is the partial entropy divided by the total; the number of SNPs used in each window is coloured red.
The output format for big_haplotype is described here |
||||
|