Altering variant identifying data
-assume-chromosome
option fills in any missing chromosome information in input data
with the specified value. E.g.:
-map-id-data
option can be used to update the identifying data for each variant
with a new set of data. E.g.:
Ffor example, this might be useful when updating files to match a new genome build.
The "map" file given to -map-id-data
must be a text file
with twelve named columns, in the following order: the current SNPID, rsid, chromosome, position, first and second alleles,
followed by the desired updated SNPID, rsid, chromosome, position and alleles. The first line is treated as column names
(currently it doesn't matter what these are called.) Variants not in this file are not affected by the mapping, and will
be output unchanged.
Matching of variants to the map file is controlled by the -compare-variants-by
option - see the page on sorting data for more on this option.
The -strand
option can be used to update alleles and flip genotype data according
to strand information supplied in an external file. The general format is:
The most common use of this option is to align alleles to match the forward strand of a reference sequence, and to flip genotypes so that the first allele is the reference allele.
Strand files should have six columns which must be named as follows:
SNPID
,
rsid
,
chromosome
,
position
,
alleleA
,
alleleB
,
strand
, plus any additional columns.
Strand information is read from the strand
column.
Alleles at variants where the strand is '+' will be processed unchanged; alleles
at variants where the strand is '-' will be complemented (i.e. A<->T, G<->C);
alleles at variants which have missing strand information - encoded as "?", or "NA", or for variants
that are missing from the file - will be omitted from the output.
The -compare-variants-by
option controls how variants between the genotype data
and the strand file.
See the page on sorting for more information.
If the -flip-to-match-allele
option is given, the strand file must contain a column
with the specified name. Each value in this column should be one of the two alleles of the variant.
Alleles and genotypes are then also recoded so that the allele in the specified
column is the first allele and the other allele the second allele.
Note that the strand alignment is applied first - e.g. if the variant alleles are 'A' and 'G' and the
strand is -, -flip-to-match-allele
the column should contain 'T' or 'C'.