Computing summary statistics
-sample-stats
option can be used to compute per-sample summary statistics.
The output goes to a file specified by the -osample
option. E.g:
This computes per-sample summary statistics (average missingness and heterozygosity) and places
them in the file sample-stats.txt
. Additionally, if array intensity data
is available (see processing intensity data), average X channel, Y
channel, total (X+Y) and difference (X-Y) of intensities will be computed.
These can be used useful for QC purposes - for example,
average intensity on the X and Y chromosomes can be used to directly determine sample gender.
Note: the output file can be formatted in various of ways, controlled by the file extension. See the page on summary statistic file formats for information on output file formatting.
-snp-stats
. E.g.:
This will compute genotype counts, allele counts and frequencies, missing data rates, info metrics, and a P-value
against the null that genotypes are in Hardy-Weinberg proportions in diploid samples. Output is sent to the
file specified in the -osnp
option. See the page on summary statistic file formats for
information on output file formatting.
Analysis on the sex chromosomes is complicated by the fact that males and females have differing ploidy.
To process sex chromosomes correctly, QCTOOL relies on the ploidy being correct in the input genotype files.
However, some data sets (and some file formats) instead encode males as diploid homozygotes.
The -infer-ploidy-from
can be used to deal with such data - see the
page on inferring ploidy .
For sex chromosomes, QCTOOL outputs both diploid and haploid genotype counts, as well an appropriate allele frequency, a sex-chromosome specific info metric, and a test for difference in frequency between males and females.
-differential
option can be used to compare levels of missingness between samples having different levels of a covariate
in the sample file. E.g.:
-stratify
option can be used to compute summary statistics stratified over
subsets of the data. E.g.:
The argument must be the name of a column in the sample file containing discrete
values (i.e. it must be of type B
or D
). Summary statistic calculations will
be computed for each subset of samples having the same value in that column.
The output will contain the same fields as for -snp-stats
, but each column will
appear multiple times with a suffix of the form [<column>=<value>]
to denote which strata the values are computed for.
This feature has several possible use cases - for example, it can be used to compute allele counts across ethnic groups in a sample of mixed ancestry, or to inspect deviation from Hardy-Weinberg equilibrium seperately in disease cases and controls.