qctool v2
A tool for quality control and analysis of gwas datasets.

Genotype file formats

QCTOOL supports the following file formats for genotype data:

Format
(recognised extensions)
I/O
Filetype
(For -[o]filetype)
Notes
(.gen, .gen.gz) both gen Optionally, an extra initial column containing chromosomes can be included in the input. QCTOOL auto-detects this in input files by counting the columns in the file. To suppress this column in output files, use the -omit-chromosome option.
(.bgen) both bgen

Output files are in BGEN v1.2 with 16 bits per probability and compressed using zlib by default. The -bgen-bits option can be used to adjust the number of bits used. The -bgen-compression option can be used to choose the method of compression (either zlib or zstd).

Use -ofiletype bgen_v1.1 to force writing files compatible with the BGEN v1.1 spec.

(.vcf, .vcf.gz) both vcf

QCTOOL is strict about VCF metadata in input files for the fields it reads. Since metadata is not always correct a -metadata option is provided to override the input file metadata. Currently, only genotypes are output when outputting VCF files.

Note that QCTOOL does not apply PHRED scaling to probabilities in the GP field.

(.bed, .bim, .fam) both binary_ped Note that QCTOOL currently does only the most basic processing of FAM files: when reading, it uses them to count the number of samples in the BED file, when writing it writes a FAM file with missing data in all fields except the ID field. You will therefore need to create fuller FAM files seperately for use with other tools.
SHAPEIT haplotype format
both shapeit_haplotypes For genotypic computations, genotypes are formed from pairs of haplotypes; it is assumed that the two haplotypes for each individual are consecutive columns in the file. (This format is described here.)
IMPUTE allele probabilities format
both impute_allele_probs to specify reading or writing this filetype. This file format is like the shapeit haplotype format but contains a probability for each haplotype (i.e. two probabilities per individual), specifying the probability that the haplotype carries the second allele.
IMPUTE haplotype format
Input only impute_haplotypes It is assumed the legend file name is the same as the haplotypes file name, minus extension, with .legend appended; QCTOOL will also remove/add the .gz extension as appropriate. For genotypic computations, genotypes are formed from pairs of haplotypes; it is assumed that the two haplotypes for each individual are consecutive columns in the haplotypes file.
HLAIMP probability format
Input only hlaimp Currently, this input format implicitly splits each HLA locus as a series of bi-allelic variants.
QCTOOL 'long' format
Input only long Input must be a file with columns SNPID, rsid, chromosome, position, number_of_alleles, allele1, other_alleles, sample_id, ploidy, genotype. Further columns may also be included (but QCTOOL ignores these). Allelesin the other_alleles column must be comma-separated (as with VCF ALT alleles). When outputting to vcf format, both genotype (GT) and a field 'typed' indicating whether a row for each sample and variant was present will be output.
Output only penncnv PennCNV uses a single sample per input file, this can be acheived using the sample filtering options, e.g. -incl-samples-where ID_1=<identifier>
BIMBAM dosage format;
QCTOOL dosage format
(.dosage[.gz])
Output only bimbam_dosage or dosage This file outputs a single column per sample (named by the sample identifier) containing the expected second allele dosage for the sample at each variant. The formats differ in that BIMBAM format has no chromosome/position information.
QCTOOL intensity text format
.intensity[.gz]
Output only intensity The output file has two columns per sample, representing X and Y channel intensities for the sample at each variant. Currently data must be read from a VCF file; the field is specified using the -vcf-intensity-field option.