Important! this page documents the v1 release series of QCTOOL, which is no longer maintained. For up-to-date versions suitable for use with UK Biobank data, please see the QCTOOL v2 page.
QCTOOL is a command-line utility program for basic quality control of gwas datasets. It supports the same file formats used by the WTCCC studies, as well as the binary file format described here and the Variant Call Format, and is designed to work seamlessly with SNPTEST and related tools . A typical use of QCTOOL is to compute per-sample and per-SNP summary statistics for a cohort, and use these to filter out samples and SNPs (either by removing them from the files or by writing exclusion lists). QCTOOL can also be used to perform various subsetting and merging operations, and to manipulate sample information in preparation for association testing - as shown on the examples page.
QCTOOL is designed to be as easy-to-use as possible and we hope you find it so.
Note: The program GTOOL, by Colin Freeman, supports a similar but slightly different set of conversion, merging and subsetting operations.
Related tools. You may be interested in these other programs, also built from the same codebase:
- inthinnerator - a command-line tool for thinning SNPs based on physical or recombination distance. Available here.
Change history. QCTOOL has undergone significant changes since the original release (webpage here). This page documents version 1.4 of QCTOOL. A short summary of changes is:
In v1.4:
- Fix bug with gzipped output.
- Fix crashing issue with Hardy-Weinberg P-value computation for large sample sizes.
In v1.3:
- A new option pair -[in|ex]cl-snps has been added. This reads a list of SNPs in the format output by the -write-snp-excl-list option. This is a file with 6 named columns, SNPID, rsid, chromosome, position, alleleA, alleleB.
- A bug preventing the -incl-snpids from working correctly was fixed.
In v1.2:
- A new option pair -[in|ex]cl-positions has been added. Each takes a file which should contain a list of genomic positions in the form <chromosome>:<position>, separated by whitespace.
- Support for bgzipped files (created using the bgzip tool from SAMtools) has been added. As for gzipped files these are automatically detected based on the .gz filename extension.
- The -sort option now works with unzipped VCF format output as described on the examples page.
- A bug relating to VCF files with more than 10,000 lines has been fixed.
- The behaviour of the -sort option has been tweaked to make it behave better on compute clusters.
In v1.1:
- Support for Variant Call Format has been added - see the file formats page.
- New options -[in|ex]cl-range, -[in|ex]cl-rsids, -[in|ex]cl-snpids, -[in|ex]cl-snps-matching for SNP filtering have been added. Options -interval and -snp-[in|ex]cl-list have been removed.
- QCTOOL can now work with multiple cohorts, treating them like one big cohort.
- The BGEN format now supports long alleles such as indels and deletions. There is an updated spec.
- A -sort option has been added. This sorts the output by chromosome and position.
- The names and usage of some options have been rationalised.
See the file CHANGELOG.txt for a full list of changes.
Acknowledgements. The following people contributed to the design and implementation of qctool:
In addition, QCTOOL contains the SNP-HWE code by Jan Wigginton et al., described in "A Note on Exact Tests of Hardy-Weinberg Equilibrium", Wigginton et al, Am. J. Hum. Genet (2005) 76:887-93.
Contact. For more information or questions, please contact the oxstatgen mailing list at
oxstatgen (at) jiscmail.ac.uk
QCTOOL was designed for use in a pipeline for QC of genotype data, the typical general structure of which is shown on the right. A detailed list of options is given by the command
$ qctool -helpwhich produces this output.
Qctool works with the following per-sample summary statistics, calculated using the -sample-stats option:
- Missing data proportion
- the total proportion of missing genotype data for this sample across all SNPs. This is the sum of the three genotype probabilities for the sample across all SNPs, divided by the total number of SNPs. A large missing data proportion might be due, for example, to a badly-prepared sample. You can filter on missingness using the -sample-missing-rate option.
- Heterozygosity
- This is the sum of heterozygote call probabilities across all SNPs divided by the total number of SNPs. A high value of heterozygosity might indicate, for example, that the DNA from this sample was been accidentally mixed with another sample during processing; a low value might indicate a higher degree of relatedness than expected among the ancestors of the individual. You can filter on heterozygosity using the -heterozygosity option.
QCTOOL works with the following per-SNP summary statistics, computed using the -snp-stats option:
- Missing data proportion
- The proportion of missing genotype data (null genotype call probabilities) across all samples for the SNP. A high value indicates that the SNP is not well called. You can filter these SNPs out using the -snp-missing-rate option.
- Missing call proportion
- The proportion of individuals for which the maximum genotype probability is less than a threshhold of 0.9. You can filter these SNPs out using the -snp-missing-call-rate option.
- Minor allele frequency
- The estimated frequency of the less common allele. The -maf option can be used to retain only SNPs within a given range of minor allele frequencies.
- HWE
- This is a -log10 P-value for Hardy-Weinberg equilibrium, computed using the SNP-HWE code by Wigginton et al. The -hwe option can be used to filter out SNPs that are out of equilibrium.
- Info
- This is IMPUTE's info measure, and is defined as one minus the average over individuals of the variance of a genotype (given the individuals' genotype call probability distribution), divided by the variance if only the allele frequency were known. It measures how much uncertainty there is in the genotype calls. It equals zero when the genotype call probabilities are obtained from the allele frequency, to 1 when the calls are all certain. Use the -info option to filter on this statistic.
In general, qctool tries to warn you if it thinks you are doing something wrong. In these cases you can override qctool using the -force option.
For more information, see the usage examples.
format | recognised extension(s) | notes |
---|---|---|
GEN format | .gen, .gen.gz | Optionally, an extra initial column containing chromosomes can be included in the input. QCTOOL auto-detects this by counting the columns in the file. To suppress this column in output files, use the -omit-chromosome option. |
BGEN format | .bgen | Output files are in BGEN v1.1. QCTOOL can still read v1.0 of the BGEN spec. |
Variant call format | .vcf, .vcf.gz | QCTOOL is strict about metadata in input files for the fields it reads; since this is not always correct a -metadata option is provided to override the input file metadata. Currently, only genotypes are output when outputting VCF files. Note: currently, QCTOOL does not apply PHRED scaling to probabilities in the GP field (this is in violation of the spec). |
IMPUTE haplotype format | (none) | Input only. To specify this filetype, use -g <haplotypes file> -filetype impute_haplotypes. It is assumed the legend file name is the same as the haplotypes file name, minus extension, with .legend appended. Genotypes are formed from pairs of haplotypes; it is assumed that the two haplotypes for each individual are consecutive columns in the haplotypes file. |
SHAPEIT haplotype format | (none) | Input only. Use -filetype shapeit_haplotypes to specify this file type. Genotypes are formed from pairs of haplotypes; it is assumed that the two haplotypes for each individual are consecutive columns in the file. (This format is described here.) |
QCTOOL is available either as binaries or as source code.
Binaries
Pre-compiled binaries are available for the following platforms.
Version | Platform | File |
---|---|---|
v1.4 | Linux x86-64 static build | qctool_v1.4-linux-x86_64.tgz (942Kb) |
v1.4 | Linux alternative build | qctool_v1.4-scientific-linux-x86_64.tgz (972Kb) |
v1.4 | Mac OS X | qctool_v1.4-osx.tgz (1.1Mb) |
v1.3 | Linux x86-64 static build | qctool_v1.3-linux-x86_64.tgz (1.7Mb) |
v1.3 | Linux alternative build | qctool_v1.3-scientific-linux-x86_64.tgz (983K) |
v1.3 | Mac OS X | qctool_v1.3-osx.tgz (1021Kb) |
v1.2* | Linux x86-64 static build | qctool_v1.2-linux-x86_64.tgz (1.7Mb) |
v1.2* | Mac OS X | qctool_v1.2-osx.tgz (1021Kb) |
v1.1* | Linux x86-64 static build | qctool_v1.1-linux-x86_64.tgz (1.6Mb) |
v1.1* | Mac OS X | qctool_v1.1-osx.tgz (1014Kb) |
v1.0* | Linux x86-64 static build | qctool_v1.0-static-linux-x86-64.bz2 (1001Kb) |
v1.0* | Mac OS X 10.6.3 | qctool_v1.0-static-osx-10.6.3.bz2 (431Kb) |
*Older versions are preserved for download here but are unsupported.
To run qctool, download the relevant file and extract it as follows.
$ tar -xzf qctool_v1.x-[machine].tgz $ cd qctool_v1.x-[machine] $ ./qctool -help
Source
The source code to qctool is available as a mercurial repository hosted on bitbucket. Assuming you have mercurial installed, a basic download and compilation sequence (for the currently released version) would be:
$ hg clone --rev qctool-release https://gavinband@bitbucket.org/gavinband/qctool destination directory: qctool requesting all changes adding changesets adding manifests adding file changes added 975 changesets with 5890 changes to 2022 files updating to branch qctool-release 1821 files updated, 0 files merged, 0 files removed, 0 files unresolved $ cd qctool $ ./waf-1.5.18 configure $ ./waf-1.5.18
./build/release/qctool-release
You will need boost and zlib installed. More detailed build instructions can be found on the QCTOOL wiki.