QCTOOL v2

Combining datasets

Combining several datasets into a larger one

QCTOOL's -g and -s options can be specified several times, with the effect of combining data into one larger dataset. E.g.:

$ qctool -g cohort1.bgen -s cohort1.sample -g cohort2.bgen -s cohort2.sample -og joined.gen -os joined.sample

QCTOOL will produce both a combined genotype file and a combined sample file. The output files will have N₁+N₂ samples, where N₁ and N₂ are the numbers of samples in the two input datasets, and it will contain all variants that can be matched between the input datasets.

QCTOOL attempts to merge the columns of the sample files based on column name and type. Columns are dropped if they have the same name but different types. Otherwise, each column in the input sample files appears in the output sample file, possibly with missing values for those cohorts for which that column is not present.

QCTOOL attempts to combine data at each variant that is present in all input datasets. To do this, it makes the assumption that data is sorted uniquely and in the same way in all input datasets. If this is not the case then QCTOOL may be unable to match some variants. Any variant that does not match between datasets will be omitted from the output.

QCTOOL processed variant identifiers in the following way: it uses the primary identifier from the first dataset as the overall primary identifier. It also keeps a list of all other identifiers observed across the datasets. (When the input file is a VCF, the first identifier is treated as the primary identifier. When the input is in GEN or BGEN format, the `rsid` column is treated as containing the primary identifier, and alternative identifiers from the `SNPID` column are also processed.). The `-map-id-data` option can also be used to force specific adjustments to these data.

Controlling how variants are matched

The -compare-variants-by option can be used to control what fields QCTOOL compares when matching variants. (The default behaviour is to match variants by the genomic position, alleles, and ID fields). For example, in the command:

$ qctool -g cohort1.bgen -s cohort1.sample -g cohort2.bgen -s cohort2.sample -og joined.gen -os joined.sample -compare-variants-by position,alleles

QCTOOL will match variants by position and alleles (variants will match even if ID fields differ). As in the example, position (the genomic position) must always be the first field matched on.

Note: careful use of these options is especially important when datasets contain multiple variants sharing the same genomic position. The recommendation is to ensure all datasets are encoded uniformly before combining them. (See the pages on sorting, aligning alleles and altering ID data for options that can help with this.)

Allowing for allele mismatches

It is sometimes the case that data is sorted in each dataset but that alleles are mixed up. The -match-alleles-to-cohort1 option tells QCTOOL to attempt to match data allowing for this type of mismatch:

$ qctool -g example.bgen -s example.sample -g second_cohort_#.gen -s second_cohort.sample -og joined.gen -os joined.sample -compare-variants-by position,alleles -match-alleles-to-cohort1

Here, if the variant is biallelic and the alleles in the second dataset are the same as those in the first cohort, but coded the other way round (e.g. cohort1 = A/G, cohort2 = G/A), then the alleles and genotypes are flipped accordingly. No other transformation (e.g. strand flips, or matching multiallelics) is performed by this operation. Thus, although this option can be convenient, the general recommendation is to arrange each dataset to be encoded and sorted unformly before combining.