QCTOOL v2

Computing PCs

QCTOOL computes PCs by first estimating a relatedness or 'kinship' matrix, and then forming the eigendecomposition. In short you use the -kinship option to compute a relatedness matrix, the -UDUT option to eigendecompose it, and the -PCs option to output PCs. A complete example would look like this:

$ qctool -g example.bgen -s example.sample -kinship kinship.csv -UDUT udut.csv -PCs 20 -osample PCs.csv

This outputs the first 20 PCs to the file PCs.csv, in addition to the estimated kinship matrix and its eigendecomposition. The following sections show the use of these options in more detail.

Computing a kinship matrix

The -kinship option can be used to estimate a kinship matrix, as in:

$ qctool -g example.bgen -s example.sample -kinship kinship.csv

This outputs pairwise kinship values to the file kinship.csv, which is stored in a 'long' format with columns holding the first sample id, second sample id, the number of pairwise non-missing genotypes, and the estimated kinship value. (Only the upper triangle of this matrix is output).

More precisely, Suppose X is the L×N matrix of genotypes, with variants indexed by row. Let f_i be an estimate of the frequency of the ith variant. We write Z for the matrix X after centring and rescaling each row based on the allele frequency,

Z_i· = (X_i· - mean(X_i·)) / √ (2 f_i (1-f_i))

QCTOOL estimates the kinship matrix as 1/L Z^t Z. In forming Z, QCTOOL uses a posterior estimate of allele frequency f_i under a Beta(2,2) distribution, i.e. f_i = (1+N_b)/(2+2N)) where N_b is the count number of 'b' alleles in the data. This can be understood as implicitly adding a single haplotype of each allelic type to the data before computing the frequency, which in turn ensures that the frequency estimate is not zero or 1.

Eigendecomposing a kinship matrix and computing PCs

The -UDUT option can be used to compute a UDUT decomposition (i.e. an eigendecomposition) of the computed kinship matrix. E.g.

$ qctool -g example.bgen -s example.sample -kinship kinship.csv -UDUT udut.csv

The output is an N×(N+1) matrix in which the first column represents the diagonal elements of D, i.e. the eigenvalues, and the following N columns are the right eigenvectors (i.e. the columns of U). To additionally output principal components (PCs), additionally add the -PCs option:

$ qctool -g example.bgen -s example.sample -kinship kinship.csv -UDUT udut.csv -PCs 20 -osample PCs.csv

The argument is the number of PCs to output.

Note: the PCs computed are simply rescaled entries of the right eigenvectors; they are computed as PC_i = √(1/L) × U_·i D^-1/2. This scaling ensures the PCs do not grow with the number of variants.

Note: PCs are output to the file specified by -osample. Depending on the command line, other values might also be output to this file. For example, if you specify both -sample-stats and -PCs, the output file will contain both per-sample summary statistics and PCs. See the page on summary statistic file formats for more information on the format of the output.

In some contexts it may be preferable to load a previously computed kinship matrix, rather than to recompute a new one. This can be acheived with the -load-kinship option:

$ qctool -g example.bgen -s example.sample -load-kinship kinship.csv -UDUT udut.csv -PCs 20 -osample PCs.csv

Computing PC loadings

Having computed the UDUT decomposition, per-variant loadings can be computed using the -loadings option, e.g.:

$ qctool -g example.bgen -s example.sample -load-udut udut.csv -loadings loadings.csv

The -PCs option can again be used to adjust how many loadings are computed. Note: you should ensure the same set of variants is used to compute loadings as were used in constructing the kinship matrix.

Projecting additional samples onto a principal components decomposition

Given a set of loadings, additional samples can be projected onto the loadings using the -project-onto option:

$ qctool -g example.bgen -s example.sample -project-onto loadings.csv projection.sample

The first argument is a set of loadings to load, while the second argument is the output file to compute. Again, we advise using the same set of variants that were used to compute the PCs.

Computing principal components