qctool v2
A tool for quality control and analysis of gwas datasets.

Altering variant identifying data

Filling in missing chromosome information
The -assume-chromosome option fills in any missing chromosome information in input data with the specified value. E.g.:
$ qctool -g <input file> -og output.bgen -assume-chromosome <chromosome>
Updating identifying data
The -map-id-data option can be used to update the identifying data for each variant with a new set of data. E.g.:
$ qctool -g <input file(s)> -og output.bgen -map-id-data <map file> [+other options]

Ffor example, this might be useful when updating files to match a new genome build.

The "map" file given to -map-id-data must be a text file with twelve named columns, in the following order: the current SNPID, rsid, chromosome, position, first and second alleles, followed by the desired updated SNPID, rsid, chromosome, position and alleles. The first line is treated as column names (currently it doesn't matter what these are called.) Variants not in this file are not affected by the mapping, and will be output unchanged.

Matching of variants to the map file is controlled by the -compare-variants-by option - see the page on sorting data for more on this option.

Aligning alleles

The -strand option can be used to update alleles and flip genotype data according to strand information supplied in an external file. The general format is:

$ qctool -g <input file(s)> -og output.bgen -strand <strand file> [-flip-to-match-allele <column name>]

The most common use of this option is to align alleles to match the forward strand of a reference sequence, and to flip genotypes so that the first allele is the reference allele.

Strand files should have six columns which must be named as follows: SNPID, rsid, chromosome, position, alleleA, alleleB, strand, plus any additional columns. Strand information is read from the strand column. Alleles at variants where the strand is '+' will be processed unchanged; alleles at variants where the strand is '-' will be complemented (i.e. A<->T, G<->C); alleles at variants which have missing strand information - encoded as "?", or "NA", or for variants that are missing from the file - will be omitted from the output.

The -compare-variants-by option controls how variants between the genotype data and the strand file. See the page on sorting for more information.

If the -flip-to-match-allele option is given, the strand file must contain a column with the specified name. Each value in this column should be one of the two alleles of the variant. Alleles and genotypes are then also recoded so that the allele in the specified column is the first allele and the other allele the second allele. Note that the strand alignment is applied first - e.g. if the variant alleles are 'A' and 'G' and the strand is -, -flip-to-match-allele the column should contain 'T' or 'C'.