Computing principal components
-kinshipoption to compute a relatedness matrix, the
-UDUToption to eigendecompose it, and the
-PCsoption to output PCs. A complete example would look like this:
This outputs the first 20 PCs to the file
PCs.csv, in addition to the estimated kinship
matrix and its eigendecomposition. The following sections show the use of these options in more detail.
-kinshipoption can be used to estimate a kinship matrix, as in:
This outputs pairwise kinship values to the file
kinship.csv, which is stored in a 'long' format
with columns holding the first sample id, second sample id, the number of pairwise non-missing genotypes,
and the estimated kinship value. (Only the upper triangle of this matrix is output).
More precisely, Suppose X is the L×N matrix of genotypes, with variants indexed by row. Let fi be an estimate of the frequency of the ith variant. We write Z for the matrix X after centring and rescaling each row based on the allele frequency,
Zi· = (Xi· - mean(Xi·)) / √ (2 fi (1-fi))
QCTOOL estimates the kinship matrix as 1/L Z^t Z. In forming Z, QCTOOL uses a posterior estimate of allele frequency fi under a Beta(2,2) distribution, i.e. fi = (1+Nb)/(2+2N)) where Nb is the count number of 'b' alleles in the data. This can be understood as implicitly adding a single haplotype of each allelic type to the data before computing the frequency, which in turn ensures that the frequency estimate is not zero or 1.
-UDUToption can be used to compute a UDUT decomposition (i.e. an eigendecomposition) of the computed kinship matrix. E.g.
The argument is the number of PCs to output.
Note: the PCs computed are simply rescaled entries of the right eigenvectors; they are computed as PCi = √(1/L) × U·i D-1/2. This scaling ensures the PCs do not grow with the number of variants.
Note: PCs are output to the file specified by
Depending on the command line, other values might also be output to this file. For example,
if you specify both
-PCs, the output file will contain both
per-sample summary statistics and PCs.
See the page
on summary statistic file formats for more information on the format of the output.
In some contexts it may be preferable to load a previously
computed kinship matrix, rather than to recompute a new one. This can be acheived with the
-PCsoption can again be used to adjust how many loadings are computed. Note: you should ensure the same set of variants is used to compute loadings as were used in constructing the kinship matrix.