qctool v2
A tool for quality control and analysis of gwas datasets.

## Computing principal components

Computing PCs
QCTOOL computes PCs by first estimating a relatedness or 'kinship' matrix, and then forming the eigendecomposition. In short you use the `-kinship` option to compute a relatedness matrix, the `-UDUT` option to eigendecompose it, and the `-PCs` option to output PCs. A complete example would look like this:
\$ qctool -g example.bgen -s example.sample -kinship kinship.csv -UDUT udut.csv -PCs 20 -osample PCs.csv

This outputs the first 20 PCs to the file `PCs.csv`, in addition to the estimated kinship matrix and its eigendecomposition. The following sections show the use of these options in more detail.

Computing a kinship matrix
The `-kinship` option can be used to estimate a kinship matrix, as in:
\$ qctool -g example.bgen -s example.sample -kinship kinship.csv

This outputs pairwise kinship values to the file `kinship.csv`, which is stored in a 'long' format with columns holding the first sample id, second sample id, the number of pairwise non-missing genotypes, and the estimated kinship value. (Only the upper triangle of this matrix is output).

More precisely, Suppose X is the L×N matrix of genotypes, with variants indexed by row. Let fi be an estimate of the frequency of the ith variant. We write Z for the matrix X after centring and rescaling each row based on the allele frequency,

Z = (X - mean(X)) / √ (2 fi (1-fi))

QCTOOL estimates the kinship matrix as 1/L Z^t Z. In forming Z, QCTOOL uses a posterior estimate of allele frequency fi under a Beta(2,2) distribution, i.e. fi = (1+Nb)/(2+2N)) where Nb is the count number of 'b' alleles in the data. This can be understood as implicitly adding a single haplotype of each allelic type to the data before computing the frequency, which in turn ensures that the frequency estimate is not zero or 1.

Eigendecomposing a kinship matrix and computing PCs
The `-UDUT` option can be used to compute a UDUT decomposition (i.e. an eigendecomposition) of the computed kinship matrix. E.g.
\$ qctool -g example.bgen -s example.sample -kinship kinship.csv -UDUT udut.csv
The output is an N×(N+1) matrix in which the first column represents the diagonal elements of D, i.e. the eigenvalues, and the following N columns are the right eigenvectors (i.e. the columns of U). To additionally output principal components (PCs), additionally add the `-PCs` option:
\$ qctool -g example.bgen -s example.sample -kinship kinship.csv -UDUT udut.csv -PCs 20 -osample PCs.csv

The argument is the number of PCs to output.

Note: the PCs computed are simply rescaled entries of the right eigenvectors; they are computed as PCi = √(1/L) × U·i D-1/2. This scaling ensures the PCs do not grow with the number of variants.

Note: PCs are output to the file specified by `-osample`. Depending on the command line, other values might also be output to this file. For example, if you specify both `-sample-stats` and `-PCs`, the output file will contain both per-sample summary statistics and PCs. See the page on summary statistic file formats for more information on the format of the output.

In some contexts it may be preferable to load a previously computed kinship matrix, rather than to recompute a new one. This can be acheived with the `-load-kinship` option:

\$ qctool -g example.bgen -s example.sample -load-kinship kinship.csv -UDUT udut.csv -PCs 20 -osample PCs.csv
Having computed the UDUT decomposition, per-variant loadings can be computed using the `-loadings` option, e.g.:
The `-PCs` option can again be used to adjust how many loadings are computed. Note: you should ensure the same set of variants is used to compute loadings as were used in constructing the kinship matrix.
Given a set of loadings, additional samples can be projected onto the loadings using the `-project-onto` option: