Bioinformatics Group

My Home Page

Marker Selection Methods

Description of Method (PDF)

Input File Format

Running the Programs

Output

Whole Chromosome Analysis

Download C Source Code

Web Server


Wellcome Trust Centre for Human Genetics

SNP Selection across whole chromosomes

big_haplotype

For the analysis of very large data-sets (for instance whole chromosomes), and modified approach is required. The basic problem is that if the number of people (and hence number of chromosomes) typed is of the same order as the number of markers typed, and if some of the markers are widely spaced so that they are not in Linkage Disequilibrium, then it is possible to choose a subset of markers that explain all the observed haplotypic diversity, but that are useless in the sense that their selection is contingent upon the accidental long-range haplotypic structure present in the dataset.

big_haplotype works slightly differently from span_haplotype and greedy_haplotype. It is used to process very long regions such as whole chromosomes, by moving a window along the genome and performing a greedy SNP search within each window. Each SNP is given a score, and then the best SNPs are selected by their score. The result is a global set of SNPs that uniformly approximate to the local haplotypic structure. A by-product is the calculation of entropy for each window which gives a local measure of linkage disequilibium (where low entropy implies high LD).

In detail, the following steps are performed: Suppose we have N SNPs and window size w, and we wish to identify the top pN SNPs, where p is a values like 0.5

  1. For each SNP window [i,i+w-1] perform a greedy SNP analysis to find the near-optimal order of SNP selection in that window. Suppose that for SNP n (n=i..i+w-1), the increase in entropy that occurs when it is selected is Dn,i. We define Dn,i = 0 if SNP n is not in the window. We also compute E(i+w/2), the total entropy for the window [i,i+w-1]
  2. The score Sn for SNP n is defined to be the total D value for all windows in which it occurs: Sn = i Dn,i
  3. The top pN SNPs, sorted by score, are selected.

In the example below we use a set of 899 SNPs typed across 80 haplotypes for Human Chromosome 19. [Data kindly provided by Lon Cardon - see Phillips MS, et al (2003) Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots. Nat Genet. 33(3):382-7.]

The plot shows the effects of selecting the best 50% of SNPs. A window size of 15 was used. The total entropy (blue) in each window is a measure of the LD in that region (where low entropy means high LD); the partial entropy (green) is the entropy acheived by typing the selected markers; the fractional entropy (purple) is the partial entropy divided by the total; the number of SNPs used in each window is coloured red.

The output format for big_haplotype is described here

Drop me an email for more details.

 
 
spacer