qctool

Important! this page documents the v1 release series of QCTOOL, which is no longer maintained. For up-to-date versions suitable for use with UK Biobank data, please see the QCTOOL v2 page.

QCTOOL is a command-line utility program for basic quality control of gwas datasets. It supports the same file formats used by the WTCCC studies, as well as the binary file format described here and the Variant Call Format, and is designed to work seamlessly with SNPTEST and related tools . A typical use of QCTOOL is to compute per-sample and per-SNP summary statistics for a cohort, and use these to filter out samples and SNPs (either by removing them from the files or by writing exclusion lists). QCTOOL can also be used to perform various subsetting and merging operations, and to manipulate sample information in preparation for association testing - as shown on the examples page.

QCTOOL is designed to be as easy-to-use as possible and we hope you find it so.

Note: The program GTOOL, by Colin Freeman, supports a similar but slightly different set of conversion, merging and subsetting operations.

Related tools. You may be interested in these other programs, also built from the same codebase:

inthinnerator - a command-line tool for thinning SNPs based on physical or recombination distance. Available here.

Change history. QCTOOL has undergone significant changes since the original release (webpage here). This page documents version 1.4 of QCTOOL. A short summary of changes is:

In v1.4:

Fix bug with gzipped output.
Fix crashing issue with Hardy-Weinberg P-value computation for large sample sizes.

In v1.3:

A new option pair -[in|ex]cl-snps has been added. This reads a list of SNPs in the format output by the -write-snp-excl-list option. This is a file with 6 named columns, SNPID, rsid, chromosome, position, alleleA, alleleB.
A bug preventing the -incl-snpids from working correctly was fixed.

In v1.2:

A new option pair -[in|ex]cl-positions has been added. Each takes a file which should contain a list of genomic positions in the form <chromosome>:<position>, separated by whitespace.
Support for bgzipped files (created using the bgzip tool from SAMtools) has been added. As for gzipped files these are automatically detected based on the .gz filename extension.
The -sort option now works with unzipped VCF format output as described on the examples page.
A bug relating to VCF files with more than 10,000 lines has been fixed.
The behaviour of the -sort option has been tweaked to make it behave better on compute clusters.

In v1.1:

Support for Variant Call Format has been added - see the file formats page.
New options -[in|ex]cl-range, -[in|ex]cl-rsids, -[in|ex]cl-snpids, -[in|ex]cl-snps-matching for SNP filtering have been added. Options -interval and -snp-[in|ex]cl-list have been removed.
QCTOOL can now work with multiple cohorts, treating them like one big cohort.
The BGEN format now supports long alleles such as indels and deletions. There is an updated spec.
A -sort option has been added. This sorts the output by chromosome and position.
The names and usage of some options have been rationalised.

See the file CHANGELOG.txt for a full list of changes.

Acknowledgements. The following people contributed to the design and implementation of qctool:

In addition, QCTOOL contains the SNP-HWE code by Jan Wigginton et al., described in "A Note on Exact Tests of Hardy-Weinberg Equilibrium", Wigginton et al, Am. J. Hum. Genet (2005) 76:887-93.

Contact. For more information or questions, please contact the oxstatgen mailing list at

                        oxstatgen (at) jiscmail.ac.uk

QCTOOL was designed for use in a pipeline for QC of genotype data, the typical general structure of which is shown on the right. A detailed list of options is given by the command

$ qctool -help

which produces this output.

Qctool works with the following per-sample summary statistics, calculated using the -sample-stats option:

Missing data proportion: the total proportion of missing genotype data for this sample across all SNPs. This is the sum of the three genotype probabilities for the sample across all SNPs, divided by the total number of SNPs. A large missing data proportion might be due, for example, to a badly-prepared sample. You can filter on missingness using the -sample-missing-rate option.
Heterozygosity: This is the sum of heterozygote call probabilities across all SNPs divided by the total number of SNPs. A high value of heterozygosity might indicate, for example, that the DNA from this sample was been accidentally mixed with another sample during processing; a low value might indicate a higher degree of relatedness than expected among the ancestors of the individual. You can filter on heterozygosity using the -heterozygosity option.

You can filter samples based on these summary statistics, or using lists of sample ids.

QCTOOL works with the following per-SNP summary statistics, computed using the -snp-stats option:

Missing data proportion: The proportion of missing genotype data (null genotype call probabilities) across all samples for the SNP. A high value indicates that the SNP is not well called. You can filter these SNPs out using the -snp-missing-rate option.
Missing call proportion: The proportion of individuals for which the maximum genotype probability is less than a threshhold of 0.9. You can filter these SNPs out using the -snp-missing-call-rate option.
Minor allele frequency: The estimated frequency of the less common allele. The -maf option can be used to retain only SNPs within a given range of minor allele frequencies.
HWE: This is a -log10 P-value for Hardy-Weinberg equilibrium, computed using the SNP-HWE code by Wigginton et al. The -hwe option can be used to filter out SNPs that are out of equilibrium.
Info: This is IMPUTE's info measure, and is defined as one minus the average over individuals of the variance of a genotype (given the individuals' genotype call probability distribution), divided by the variance if only the allele frequency were known. It measures how much uncertainty there is in the genotype calls. It equals zero when the genotype call probabilities are obtained from the allele frequency, to 1 when the calls are all certain. Use the -info option to filter on this statistic.

You can filter out SNPs based on these summary statistics, on chromosome and position, or based on the SNP ids.

In general, qctool tries to warn you if it thinks you are doing something wrong. In these cases you can override qctool using the -force option.

For more information, see the usage examples.

This page shows command lines used to carry out common tasks with QCTOOL. We assume the program is being run from a directory containing a sample file example.sample and 22 GEN files named example_01.gen, example_02.gen, etc. As one of the first examples, we convert these files to BGEN format and use this in subsequent examples. See also the list of options.

View the program usage page

$ qctool -help | less

Convert GEN file(s) to other formats:

$ qctool -g example_#.gen -og example.bgen

(This makes a BGEN file containing all the variants.)

$ qctool -g example_#.gen -og example.vcf

(This makes a VCF file containing all the variants.)

$ qctool -g example_#.gen -og example_#.bgen

(This makes one output file per input file.)

$ qctool -g example_01.gen -og example_01.bgen -assume-chromosome 01

(This converts just one file, filling in the chromosome information.)

$ qctool -g example_01.bgen -og example_01.gen -omit-chromosome

(This writes GEN-format files without a chromosome column. These are suitable for use with other programs such as GTOOL and IMPUTE.)

Note: see file formats for a description of file formats understood by QCTOOL.

Convert VCF file(s) to other formats:

$ qctool -g example.vcf -og converted.bgen

This reads genotype calls from the GT field in the VCF file. Suitable metadata must be supplied.

$ qctool -g example.vcf -vcf-genotype-field my_field -og converted.bgen

This reads genotype probabilities from the my_field field in the VCF file. Suitable metadata must be supplied.

Send output to a pipe:

$ qctool -g example.bgen -og - | less -S

Note: currently this outputs genotypes in GEN format.

Sort a file for chromosome/position order

$ qctool -g example.bgen -sort -og sorted.bgen

Note: currently sorting is supported for unzipped GEN, unzipped VCF, and BGEN format output files.

Subset SNPs

$ qctool -g example.bgen -og subsetted.gen -incl-rsids rsids_to_include.txt

$ qctool -g example.bgen -og subsetted.gen -excl-positions positions_to_exclude.txt

The file passed to -excl-positions should contain a whitespace-separated list of positions in the form <chromosome>:<position>.

$ qctool -g example.bgen -og subsetted.gen -excl-range 06:25000000-40000000

$ qctool -g example.bgen -og subsetted.gen -snp-missing-rate 0.05 -maf 0.01 1 -info 0.9 1 -hwe 20

Subset SNPs, writing an exclusion list

$ qctool -g example.bgen -write-snp-excl-list snp_exclusions.txt -snp-missing-rate 0.05

Compute sample summary statistics:

$ qctool -g example.bgen -s example.sample -sample-stats example.sample-stats -os example.sample

Note: combine this with SNP subset options to use only a subset of SNPs in the computation. The existing sample file will be backed up to example.sample~1.

Filter out samples

$ qctool -g example.bgen -s example.sample -og filtered.bgen -excl-samples samples_to_exclude.txt

$ qctool -g example.bgen -s example.sample -og filtered.bgen -sample-missing-rate 0.1 -heterozygosity 0.2 0.3

Note: you must first use -sample-stats to populate the missing and heterozygosity columns in the sample file (or populate them in some other way.)

$ qctool -g example.bgen -s example.sample -sample-missing-rate 0.1 -heterozygosity 0.2 0.3 -write-sample-excl-list my_excluded_samples.txt

Compute per-SNP summary statistics

$ qctool -g example.bgen -snp-stats example.snp-stats

Note: this computes snp summary statistics using all samples. To use a subset of samples, use for example:

$ qctool -g example.bgen -s example.sample -snp-stats example.snp-stats -excl-samples samples_to_exclude.txt

$ qctool -g example.bgen -s example.sample -snp-stats example.snp-stats -sample-missing-rate 0.1 -heterozygosity 0.2 0.3

Combine two or more datasets into one large one

$ qctool -g example.bgen -s example.sample -g second_cohort_#.gen -s second_cohort.sample -og joined.gen -os joined.sample

Note: This command matches SNPs by the fields specified by the -snp-match-fields option, which defaults to matching by genomic position, rsid, SNPID, and alleles. It also assumes variants are sorted by these fields. Currently, variants are dropped from the resulting output if they do not match across all the input cohorts.

The output file will have N = ∑_i N_i samples, where N_i is the number of samples in the ith input cohort. When writing the sample files, QCTOOL attempts to merge the columns of the sample files based on name and type. Columns are dropped if they have the same name but different types. Otherwise, each column in the input sample files appears in the output sample file, possibly filling in missing values for those cohorts for which that column is not present.

$ qctool -g example.bgen -s example.sample -g second_cohort_#.gen -s second_cohort.sample -og joined.gen -os joined.sample -snp-match-fields position,alleles

Note: This matches SNPs by position and alleles only. This option is useful in situations where different cohorts have different SNPID or rsid fields - for example, when merging datasets that have been typed on different microarray platforms, or datasets that have been imputed seperately.

$ qctool -g example.bgen -s example.sample -g second_cohort_#.gen -s second_cohort.sample -og joined.gen -os joined.sample -snp-match-fields position,alleles -match-alleles-to-cohort1

Note: This again matches by position and alleles. However, if the alleles in the second (and subsequent) cohorts are the same as those in the first cohort, but coded the other way round (e.g. cohort1 = A G, cohort2 = G A), then the alleles are flipped accordingly.

Note that this option will generally only be useful if the alleles in all cohorts are represented with respect to the same reference strand, e.g. the forward strand. It does not adjust strand alignment.

Add genotype dosages for a given SNP or SNPs to the sample file

$ qctool -g example_#.gen -s example.sample -os example.sample -condition-on rs1234

This adds a column with name rs1234:additive_dosage, containing the additive dosage from the SNP, to the sample file. You can also select SNPs by position:

$ qctool -g example_#.gen -s example.sample -os example.sample -condition-on pos~03:10001

It is also possible to select dominant, recessive, or heterozygote dosages from the SNP:

$ qctool -g example_#.gen -s example.sample -os example.sample -condition-on "pos~03:10001(add|dom|het|rec)"

Note: these options behave in the same way as SNPTEST's -condition_on option.

Quantile-normalise columns of the sample file

$ qctool -g example.bgen -s example.sample -os example.sample -quantile-normalise column1

This adds a column with name column1:quantile-normalised to the sample file. The specified columns must be continuous, i.e. of type 'P' or 'C'. Note that even though the SNPs are not used here, currently you must still specify the -g option.

$ qctool -g example.bgen -s example.sample -os example.sample -quantile-normalise column1,column2

Note: this option behaves in the same way as SNPTEST's -normalise option.

QCTOOL supports the following file formats for genotype data:

format	recognised extension(s)	notes
GEN format	.gen, .gen.gz	Optionally, an extra initial column containing chromosomes can be included in the input. QCTOOL auto-detects this by counting the columns in the file. To suppress this column in output files, use the -omit-chromosome option.
BGEN format	.bgen	Output files are in BGEN v1.1. QCTOOL can still read v1.0 of the BGEN spec.
Variant call format	.vcf, .vcf.gz	QCTOOL is strict about metadata in input files for the fields it reads; since this is not always correct a -metadata option is provided to override the input file metadata. Currently, only genotypes are output when outputting VCF files. Note: currently, QCTOOL does not apply PHRED scaling to probabilities in the GP field (this is in violation of the spec).
IMPUTE haplotype format	(none)	Input only. To specify this filetype, use -g <haplotypes file> -filetype impute_haplotypes. It is assumed the legend file name is the same as the haplotypes file name, minus extension, with .legend appended. Genotypes are formed from pairs of haplotypes; it is assumed that the two haplotypes for each individual are consecutive columns in the haplotypes file.
SHAPEIT haplotype format	(none)	Input only. Use -filetype shapeit_haplotypes to specify this file type. Genotypes are formed from pairs of haplotypes; it is assumed that the two haplotypes for each individual are consecutive columns in the file. (This format is described here.)

QCTOOL is available either as binaries or as source code.

Binaries

Pre-compiled binaries are available for the following platforms.

Version	Platform	File
v1.4	Linux x86-64 static build	qctool_v1.4-linux-x86_64.tgz (942Kb)
v1.4	Linux alternative build	qctool_v1.4-scientific-linux-x86_64.tgz (972Kb)
v1.4	Mac OS X	qctool_v1.4-osx.tgz (1.1Mb)

v1.3	Linux x86-64 static build	qctool_v1.3-linux-x86_64.tgz (1.7Mb)
v1.3	Linux alternative build	qctool_v1.3-scientific-linux-x86_64.tgz (983K)
v1.3	Mac OS X	qctool_v1.3-osx.tgz (1021Kb)

v1.2^*	Linux x86-64 static build	qctool_v1.2-linux-x86_64.tgz (1.7Mb)
v1.2^*	Mac OS X	qctool_v1.2-osx.tgz (1021Kb)

v1.1^*	Linux x86-64 static build	qctool_v1.1-linux-x86_64.tgz (1.6Mb)
v1.1^*	Mac OS X	qctool_v1.1-osx.tgz (1014Kb)

v1.0^*	Linux x86-64 static build	qctool_v1.0-static-linux-x86-64.bz2 (1001Kb)
v1.0^*	Mac OS X 10.6.3	qctool_v1.0-static-osx-10.6.3.bz2 (431Kb)

^*Older versions are preserved for download here but are unsupported.

To run qctool, download the relevant file and extract it as follows.

$ tar -xzf qctool_v1.x-[machine].tgz
$ cd qctool_v1.x-[machine]
$ ./qctool -help

Source

The source code to qctool is available as a mercurial repository hosted on bitbucket. Assuming you have mercurial installed, a basic download and compilation sequence (for the currently released version) would be:

    $ hg clone --rev qctool-release https://gavinband@bitbucket.org/gavinband/qctool
    destination directory: qctool
    requesting all changes
    adding changesets
    adding manifests
    adding file changes
    added 975 changesets with 5890 changes to 2022 files
    updating to branch qctool-release
    1821 files updated, 0 files merged, 0 files removed, 0 files unresolved
    $ cd qctool
    $ ./waf-1.5.18 configure
    $ ./waf-1.5.18

This produces an executable

./build/release/qctool-release

You will need boost and zlib installed. More detailed build instructions can be found on the QCTOOL wiki.