QCTOOL v2

Annotating with sequence bases

The -annotate-sequence option can be used to extract sequence bases from FASTA file(s) and annotate the output file with them. E.g.:

$ qctool -g <input file> -osnp output.txt -annotate-sequence chr#.fa reference

Note Currently, to be used with this option, FASTA files must be split into one file per chromosome; the chromosome is then inferred from the file name (using the chromosomal wildcard character #) as in the command above. Also, it's assumed that the FASTA file starts at base pair 1 so that bases may be looked up in the file. I have used this to annotate alleles from the human reference sequence (e.g. this one) or ancestral sequence (e.g. this one).

Annotating with flanking sequence

When using -annotate-sequence, the -flanking option tells QCTOOL to additionally annotate output with flanking sequence from FASTA files. For example:

$ qctool -g <input file> -osnp output.txt -annotate-sequence chr#.fa reference -flanking 200 200

This will output the 200bp from the FASTA file preceding and following each variant, and the bases covered by the variant's alleles.

Annotating with genetic map information

The -annotate-genetic-map option can be used to output genetic (recombination) map coordinates for each variant, e.g:

$ qctool -g <input file(s)> -annotate-genetic-map genetic_map_chr#.txt -osnp output.txt

The genetic map files should be in the 'hapmap' format, i.e. one file per chromosome with three columns specifying position, recombination rate in centimorgans per megabase, and the accumulated recombination map position. It is expected that genetic map files are split by chromosome, and the chromosome is inferred from the filename. Suitable genetic map files for human build 37 can be found on the IMPUTE2 website. The output will contain columns cM_per_Mb and cM_from_start_of_chromosome.

Interval annotations

The -annotate-bed3 and -annotated-bed4 options can be used to compute membership of the intervals in a BED file, or the value(s) assigned to intervals in a bed file, at each input variant:

$ qctool -g <input file(s)> -annotate-bed3 file1.bed -annotated-bed4 file2.bed

Output will contain a column with the same name as the BED file (minus the .bed or .bed.gz extension). For -annotate-bed3, this column will contain a 1 if the variant was contained in an interval in the file, or 0 otherwise. For -annotate-bed4, the column will contain a comma-separated list of values from the fourth column of the BED file, for those intervals which the variant is in.

Note: BED files are assumed to contain intervals in 0-based, right-open coordinates, while QCTOOL by convention assumes genotype data is expressed in 1-based coordinates. QCTOOL handles this internally by adding 1 to the start coordinate of each interval.

It's also possible to compute membership of intervals in a set of BED files in the same column. The general syntax is -annotate-bed[3|4] file1.bed[,file2.bed[,...]][+<N>bp]. This internally concatenates file1.bed, file2.bed, etc. into a single list of intervals. Further, if the +<N>bp modifier is added, where N is an integer, then all intervals are expanded by N bases to the left and the right before processing. For example:

$ qctool -g <input file(s)> -annotate-bed4 file1.bed,file2.bed+100bp

This command will annotate each variant with the values of all intervals that it lies within 100bp of.

Computing annotations