ic, a post-Imputation data checking program

Background

ic is a set of programs designed to produce a single html page visual summary of one or more imputed data sets from the most common imputation programs. The poster from the ASHG 2016 meeting that describes this program, and the pre-imputation checking, can be downloaded here (3.8MB).

Download:

Using the latest version (>v1.0.8) only requires the ic.pl program. Earlier versions required an additional program to do the initial parsing of the VCF file. The programs are available to download here:

v1.0.2
ic.v1.0.2.zip
v1.0.3
ic.v1.0.3.zip
v1.0.4
ic.v1.0.4.zip
v1.0.5
ic.v1.0.5.zip
v1.0.6
ic.v1.0.5.zip
v1.0.6
ic.v1.0.6.zip
v1.0.7
ic.v1.0.7.zip
v1.0.8
ic.v1.0.8.zip
v1.0.9
ic.v1.0.9.zip

Update History:

v1.0.2 Updated path to Java executable
v1.0.3 Updated to run using the info files from Michigan Imputation server
v1.0.4 Fixed bug with the 1000G parsing
v1.0.5 Added ability to calculate AF from AC and AN for Umich data
v1.0.6 Added function to read summary level data
v1.0.7 Testing to speed up processing of summary level data and addition of a function to create blank plots if a chromosome is missing plus bug fix on the info score summary plot
v1.0.8 Updated file reading to cope with bgzipped files, removing need for vcfparse, Added better handling of paths to ensure the Java executable is found at run time
v1.0.9 Added gzip reference panel reading

Installation

No installation of the program is required, extract all the files in the zip file to a directory. There are however a number of dependencies, see Requirements below on how to install these, this assumes you have root access to the system you are working on.

Requirements:

The program requires the GD libraries to be installed to create the plots.

Install libGD

Installation will vary by system, on Ubuntu installation of libGD can be done systemwide by using the following commands:

Before 16.04:
sudo apt-get -y install libgd2-xpm-dev build-essential
16.04 onwards:
sudo apt-get -y install libgd-dev build-essential

Install Perl GD::Graph

Install cpanm and then GD::Graph:
sudo cpan App::cpanminus
sudo cpanm GD::Graph

Download the reference panel:

1000G phase 3 summary (from here 1.44GB) or the tab delimited HRC summary (HRC release 1 or 1.1 from the HRC web site).
At the moment the TOPMed panel is not supported however this is being worked on.

Usage

All versions >1.0.8 can use the .info files directly from the Michigan Imputation server and so do not require the use of this vcfparse script. Versions before v1.0.8 or the use of Sanger imputation files require the first 8 columns extracted from the VCFs, use the vcfparse.pl script and the following instructions to extract them.

vcfparse usage:

vcfparse.pl -d <directory of VCFs> -o <outputname> [-g]
where

-d
The path to the directory containing imputed VCFs.
-o
Specifies the output directory name, will be created if it doesn't exist.
-g
Flag to specify the output files are gzipped.

The program will not overwrite files of the same name and this process will be required for each imputed data set.
Once the VCFs are converted the main program can be run with the following options.

ic Usage:

ic -d<directory> -r <Reference panel> [-h | (-g -p <population> )][-f <mappings file>] [-o <output directory>]

Options:

-d --directory Directory Top level directory containing either one set of per chromosome .info files, or multiple directories each containing a set of per chromosome .info files
This directory will be searched recursively for files matching the required formats
Files may be gzipped or uncompressed
-f --file Mapping file Mapping file of directory name to cohort name, optional but recommended when using multiple data sets
-r --ref Reference panel Reference panel summary file, either 1000G or the tab delimited HRC (r1 or r1.1)
-h --hrc
Flag to indicate Reference panel file is HRC, defaults to HRC if no option is given
-g --1000g
Flag to indicate Reference panel file is 1000G
-p --pop Population Population to check allele frequency against
Applies to 1000G only, defaults to ALL if not supplied
Options available ALL, EUR, AFR, AMR, SAS, EAS
-o
Output Directory
Top level directory to contain all the output folders

Mapping file

The mapping file should consist of two columns:

The directory name (optionally including the path)
The name you wish to use for the output files

Example mapping file:

/full/path/to/folder/1 MyStudy1
./path/to/folder/2 MyStudy2
folder3 MyStudy3



If no mapping file is supplied the program will attempt to determine a unique set of names from the top level directory and/or sub-directories supplied with the -d option, this may or may not end up with unique folders for each output, if not the program will start an auto-increment on the file names within the directory (these will be consistent across each data set).
One advantage of using a mapping file is the data sets provided need not be all in the same base path.


Formats Supported

Currently Imputed files from the University of Michigan and the Sanger Institute. Impute is also supported but requires a different reformatter, contact me in this case.


Output

An example of the html output can be found here: Sample QC Report (5.2MB).

Known issues

Gzip in Perl does not support bgzip chunks, however there are now work arounds in the code, which should work in most cases, contact me if you have problems with this.
If there are errors on the allele frequency plots this could be as earlier versions of the University of Michigan imputation output did not contain the AF in the VCF. Contact me for an earlier version that can support this.

Future plans

We are now planning a container based version to avoid any issues that the requirement to install GD might cause.