ic is a set of programs designed to produce a single html page visual summary of one or more imputed data sets from the most common imputation programs. The poster from the ASHG 2016 meeting that describes this program, and the pre-imputation checking, can be downloaded here (3.8MB).
Using the latest version (>v1.0.8) only requires the ic.pl program. Earlier versions required an additional program to do the initial parsing of the VCF file. The programs are available to download here:
No installation of the program is required, extract all the files in the zip file to a directory. There are however a number of dependencies, see Requirements below on how to install these, this assumes you have root access to the system you are working on.
The program requires the GD libraries to be installed to create the plots.
Installation will vary by system, on Ubuntu installation of libGD can be done systemwide by using the following commands:
Before 16.04:
sudo apt-get -y install libgd2-xpm-dev build-essential
16.04 onwards:
sudo apt-get -y install libgd-dev build-essential
Install cpanm and then GD::Graph:
sudo cpan App::cpanminus
sudo cpanm GD::Graph
1000G phase 3 summary (from
here
1.44GB) or the tab delimited HRC summary (HRC release 1 or 1.1
from the HRC web
site).
At the moment the TOPMed panel is not supported however this is being worked on.
vcfparse.pl -d <directory of VCFs>
-o <outputname> [-g]
where
-d |
The path to the directory containing imputed VCFs. |
-o |
Specifies the output directory name, will be created if it doesn't exist. |
-g |
Flag to specify the output files are gzipped. |
The program will not overwrite files of the same name and this
process will be required for each imputed data set.
Once the VCFs are converted the main program can be run with the
following options.
ic -d<directory> -r
<Reference panel> [-h |
(-g -p <population>
)][-f <mappings file>] [-o
<output directory>]
-d --directory | Directory | Top
level directory containing either one set of per
chromosome .info files, or multiple directories each containing
a set of per chromosome .info files This directory will be searched recursively for files matching the required formats Files may be gzipped or uncompressed |
-f --file | Mapping file | Mapping
file of directory name to cohort name, optional but
recommended when using multiple data sets |
-r --ref | Reference panel | Reference panel summary file, either
1000G
or the tab delimited HRC (r1 or r1.1) |
-h --hrc | Flag
to indicate Reference panel file is HRC, defaults to HRC
if no option is given |
|
-g --1000g | Flag to indicate Reference panel file is 1000G | |
-p --pop | Population | Population
to check allele frequency against Applies to 1000G only, defaults to ALL if not supplied Options available ALL, EUR, AFR, AMR, SAS, EAS |
-o |
Output
Directory |
Top
level directory to contain all the output folders |
The mapping file should consist of two columns:
Example mapping file:
/full/path/to/folder/1 | MyStudy1 |
./path/to/folder/2 | MyStudy2 |
folder3 | MyStudy3 |
If no mapping file is supplied the program will attempt to
determine a unique set of names from the top level directory
and/or sub-directories supplied with the -d option, this may or
may not end up with unique folders for each output, if not the
program will start an auto-increment on the file names within the
directory (these will be consistent across each data set).
One advantage of using a mapping file is the data sets provided
need not be all in the same base path.
Currently Imputed files from the University of Michigan and the Sanger Institute. Impute is also supported but requires a different reformatter, contact me in this case.
Gzip in
Perl does not support bgzip chunks, however there are now work
arounds in the code, which should work in most cases, contact
me if you have problems with this.
If there are errors on the allele frequency
plots this could be as earlier versions of the University of
Michigan imputation output did not contain the AF in the VCF.
Contact me for an earlier version that can support this.
We are now
planning a container based version to avoid any issues that the
requirement to install GD might cause.