Computing principal components
-kinship
option to compute a
relatedness matrix, the -UDUT
option to eigendecompose it, and the -PCs
option
to output PCs. A complete example would look like this:
This outputs the first 20 PCs to the file PCs.csv
, in addition to the estimated kinship
matrix and its eigendecomposition. The following sections show the use of these options in more detail.
-kinship
option can be used to estimate a kinship matrix, as in:
This outputs pairwise kinship values to the file kinship.csv
, which is stored in a 'long' format
with columns holding the first sample id, second sample id, the number of pairwise non-missing genotypes,
and the estimated kinship value. (Only the upper triangle of this matrix is output).
More precisely, Suppose X is the L×N matrix of genotypes, with variants indexed by row. Let fi be an estimate of the frequency of the ith variant. We write Z for the matrix X after centring and rescaling each row based on the allele frequency,
Zi· = (Xi· - mean(Xi·)) / √ (2 fi (1-fi))
QCTOOL estimates the kinship matrix as 1/L Z^t Z. In forming Z, QCTOOL uses a posterior estimate of allele frequency fi under a Beta(2,2) distribution, i.e. fi = (1+Nb)/(2+2N)) where Nb is the count number of 'b' alleles in the data. This can be understood as implicitly adding a single haplotype of each allelic type to the data before computing the frequency, which in turn ensures that the frequency estimate is not zero or 1.
-UDUT
option can be used to compute a UDUT decomposition (i.e. an eigendecomposition)
of the computed kinship matrix.
E.g.
-PCs
option:
The argument is the number of PCs to output.
Note: the PCs computed are simply rescaled entries of the right eigenvectors; they are computed as PCi = √(1/L) × U·i D-1/2. This scaling ensures the PCs do not grow with the number of variants.
Note: PCs are output to the file specified by -osample
.
Depending on the command line, other values might also be output to this file. For example,
if you specify both -sample-stats
and -PCs
, the output file will contain both
per-sample summary statistics and PCs.
See the page
on summary statistic file formats for more information on the format of the output.
In some contexts it may be preferable to load a previously
computed kinship matrix, rather than to recompute a new one. This can be acheived with the
-load-kinship
option:
-loadings
option, e.g.:
-PCs
option can again be used to adjust how many loadings are computed.
Note: you should ensure the same set of variants is used to compute
loadings as were used in constructing the kinship matrix.
-project-onto
option: