QCTOOL v2

Assessing linkage disequilibrium

Computing LD metrics

The -compute-ld-with option can be used to compute LD metrics between the genotype data and a seperate set of genotype contained in a second file. E.g.:

$ qctool -g example.bgen -s example.sample -compute-ld-with second.bgen second.sample -old sqlite://ld.sqlite:LD

For each biallelic variant in example.bgen, and each biallelic variant in second.bgen, QCTOOL constructs the pairwise table of non-missing genotypes, uses an EM algorithm to resolve phase of the double heterozygotes, and then outputs the frequency of each haplotype and the D' and r² statistics. (Multiallelic variants are currently not handled.)

Note: to compute LD, samples are matched between datasets using the primary ID (the first column in the sample files) by default. To alter this, use the -match-sample-ids option. This is described further on the page on merging datasets.

Note: Currently LD output is always to a sqlite database file. In the above command, results are placed in the file ld.sqlite and in a table called LD; a convenience view, called LDView is also constructed. The simplest way to view that data is to use the sqlite3 command-line client (which is already installed on most linux systems by default). E.g. the command:

$ sqlite3 -csv -header ld.sqlite "SELECT * FROM LDView LIMIT 10"

Will list the first ten LD records in CSV format. The file can equally be accessed programmatically, for example using the sqlite3 module in python, or the RSQLite package in R. We may add flat file support in future.

Controlling what is output

Pairwise LD computation can generate a massive amount of data. To reduce this the -min-r2 and -max-ld-distance options can be used, e.g.

$ qctool -g example.bgen -s example.sample -compute-ld-with second.bgen second.sample -old sqlite://ld.sqlite:LD -min-r2 0.05 -max-ld-distance 1Mb

will output results only for variants estimated to have r² > 0.05, and within a megabase of each other. For the latter option, you can also specify distances in base pairs (e.g. -max-ld-distance 1000), in kilobases (e.g. -max-ld-distance 1kb), or in megabases as in the example above.

Adjusting the degree of shrinkage

QCTOOL outputs shrinkage estimates of LD by default. Specifically, haplotype frequencies are estimated under a Dirichlet(x,x,x,x) prior, with x=1.25 by default; roughly speaking this is equivalent to adding a quarter of an observation of each of the four haplotypes to the data. Intuitively this corresponds to weak prior assumptions that 1: both variants are polymorphic and 2: there is at least some recombination between them.

The -prior-ld-weight option can be used to adjust the strength of this prior, e.g. the command

$ qctool -g example.bgen -s example.sample -compute-ld-with second.bgen second.sample -old sqlite://ld.sqlite:LD -ld-prior-weight <w>

estimates LD under a Dirichlet(1+w/4,1+w/4,1+w/4,1+w/4) prior; setting w=0 removes the prior altogether.

Stratifying LD computations across subsets

As for -snp-stats, the -stratify option tells QCTOOL to compute LD statistics stratified over subsets of the data. E.g. suppose the sample file contains a column called POP which reflects the population group of each sample. Then the command:

$ qctool -g example.bgen -s example.sample -compute-ld-with second.bgen second.sample -old sqlite://ld.sqlite:LD -stratify POP

will compute LD statistics for all pairs of variants in each population. Output columns will be named in the form POP=<level>:<variable>.

For example, suppose we've extracted data for the O blood group mutation rs8176719 from the 1000 Genomes Project data into a file named rs8176719.vcf. This command will find all tagging SNPs in the flanking region:

$ qctool -g ALL.chr9.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz -s 1000G.sample -compute-ld-with rs8176719.vcf 1000G.sample -old sqlite://ld.sqlite:LD -stratify POP -min-r2 0.75 -max-distance 1Mb

Note that currently it's first necessary to make a sample file in the format described here for this to work.

Note: This will also take a while to run, since at present it scans the whole of chromosome 9. A quicker way would be to use bgenix together with QCTOOL in in a pipeline.