McCarthy Group Tools

This page provides links and information on tools the group has developed

email: wrayner at well dot ox dot ac dot uk

Tools on this page






ScatterShot
A simple to use cluster plotting program.
The ASHG 2013 poster describing this can be downloaded here:
WR-ASHG2013posterPP-portrait.pdf

Current Version (13Mb)
ScatterShot.2014JAN17.zip

Previous Versions

ScatterShot.2013DEC09.zip (13Mb)
ScatterShot.2013NOV25.r1.zip (12.9Mb)
ScatterShot_2013OCT25_r1.zip (16.4Mb)
ScatterShot_2013OCT24_r2.zip (15.5Mb)
ScatterShot.zip  (12.9Mb)



Strand and Position files for Common Genotyping Chips
The web page for the strand data can be found here Strand data

The ASHG 2011 Poster describing this can be downloaded here:
WR-ASHG2011posterPP-portrait.pdf



AB Alleles to Top Strand

If your genotypes are represented as A/B then the files present on the link below to access files that allow you to update the alleles from A/B to TOP strand A, C, G, T calls.
Page can be found here



Plot Tool
Will be available soon email wrayner at well dot ox dot ac dot uk for latest details



Physical Activity Program
Program to derive Kcal values from questionnaire data
Download here:
PhysicalActivityProgram.zip



HRC or 1000G Imputation preparation and checking
Program to check a QC'd plink .bim file against the HRC, 1000G or CAAPA reference SNP list in advance of imputation

Version 4.2.1

HRC-1000G-check-bim-v4.2.zip

Version 4.2.2
HRC-1000G-check-bim.zip

Version 4.2.3
HRC-1000G-check-bim.zip

Version 4.2.4
HRC-1000G-check-bim.v4.2.4.zip

Version 4.2.5
HRC-1000G-check-bim.v4.2.5.zip

Version 4.2.6
HRC-1000G-check-bim-v4.2.6.zip

Version 4.2.7
HRC-1000G-check-bim-v4.2.7.zip

Changes V4.2.1 to V4.2.2
As of v4.2.2 new two new options have been added for allele frequency thresholds (-t <difference>, -n)
-t 0.2 sets allele difference threshold to 0.2.
 Use this to change the allele frequency difference used to exclude SNPs in the final file, range 0 - 1, the larger the difference the fewer SNPs will be excluded.
Use -n flag to specify that you do not wish to exclude any SNPs based on allele frequency difference, if -n is used -t has no effect.

Changes V4.2.2 to V4.2.3
Fixed bug whereby SNPs that were incorrectly mapped in the bim file were updated only for position and not chromosome.

Changes V4.2.3 to V4.2.4
Added support for X chromosome in HRC release v1.1

Changes V4.2.4 to V4.2.5
Fixed bug with Chromosome X being left out of plink command file

Changes V4.2.5 to V4.2.6
Added the ability to use gzipped reference panels
Implemented a check to ensure the same number of variants are present in the .bim and .frq files

Changes V4.2.6 to V4.2.7
Minor update to the usage information display

Checks: strand, alleles, position, Ref/Alt assignments and frequency differences.
In addition to the reference file v4 and above require the plink .bim and (from the plink --freq command) .frq files.
Produces: set of plink commands to update or remove SNPs (see below and changes to V4.2.2) based on the checks
as well as a file (FreqPlot) of cohort allele frequency vs reference panel allele frequency.

Updates: strand, position, ref/alt assignment
Removes: A/T & G/C SNPs if MAF > 0.4, SNPs with differing alleles, SNPs with > 0.2 allele frequency difference (can be removed/changed in V4.2.2), SNPs not in reference panel


Usage with HRC reference panel:
Requires the unzipped tab delimited HRC reference (currently v1.1 HRC.r1-1.GRCh37.wgs.mac5.sites.tab) from the Haplotype Reference Consortium Website here:
 http://www.haplotype-reference-consortium.org/site
Usage: perl HRC-1000G-check-bim.pl -b <bim file> -f <Frequency file> -r <Reference panel> -h


Usage with 1000G reference panel
Requires the unzipped 1000G legend file (instructions to create this are below) or download here (1.4G in size):
1000GP_Phase3_combined.legend.gz

Usage:
perl HRC-1000G-check-bim.pl -b <bim file> -f <Frequency file> -r <Reference panel> -g -p <population>
1000G population will default to ALL if not specified.

Usage with CAAPA reference panel
The CAAPA reference panel can be downloaded here (513MB): all.caapa.sorted.zip
Many thanks to Kathleen Barnes and Michelle Daya of CAAPA for sharing this and also to Margaret Parker and Michael Cho for the initial reformatting
Usage is the same as for the HRC panel, namely using the flags -h -r all.caapa.sorted.txt
Usage: perl HRC-1000G-check-bim.pl -b <bim file> -f <Frequency file> -r all.caapa.sorted.txt -h



1000G reference panel file
The reference file for 1000G does not exist as one file, some steps are necessary to create it
you will need the legend files from the 1000GP_phase3.tgz file on the impute web site
https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html
Download the .tgz and extract all .legend files, place them in a directory together with the following script
concatenate-1000GP.zip
and run the script (perl concatenate-1000GP.pl)
this will create a file (1000GP_Phase3_combined.legend) suitable for use with the checking program

Original script
The original script for HRC only checking can still be downloaded here, but I recommend using one V4.2.1 or above
as there are more features and a number of issues that are corrected in the most recent versions:
HRC-check-bim.zip


Post Imputation Checking

Instructions for the download and use of the post Imputation Checking program ic can be found on this page.



UkBioBank Phenotype Extraction Script
Perl script to extract a subset of data from the UKBioBank phenotype data download.
Requires the Stata .dct and R .tab files to run.
Run without any arguments to get a description of usage.

Download:
Version 1.0
 extract.zip


Version 1.2.2
extract-v1.2.2.zip



Phenotype File Summary and Fingerprinting
This program was developed to give an overview of phenotypes (columns) in a tab delimited text file, to do this it attempts to determine the type of each column (numeric, text, mixed text/numeric or coded variables) and summarise it appropriately.

Download
v3.1
phenotype_qc.zip

v4.1
Updated to better accommodate the missing column names in the dct file.
phenotype_qc.v4.1.zip


Usage:

Version 3.1
phenotype_qc.pl -f <phenotypeFile> [-m <missingValueIdentifier> -i <sampleIDcolumn> -c <column> -s <columnFile> -l <headerLookup> -h]

Version 4.1
Version 4.1 introduces a new method for extracting the data from the phenotype file, this is necessary as for large data sets (>5GB) where v3.1 would not perform well or at all. This version works on data sets of any size however it will only work in a Unix/Linux environment (Cygwin possible but not tested) as it requires the UNIX cut command to extract the data. The option -k specifies how many columns at a time to extract using cut, the default is 25 which is a trade off between the number of calls to cut vs the amount of memory required. If you have problems with memory usage set this a smaller number. Higher numbers are possible but on test sets to date the time to complete a run increases, probably due to memory allocation.
The second new option (-p) is to plot all the numeric phenotypes, this allows a quick visual check of the distribution of values. Running this version, with or without plotting will still require GD::Graph to be installed, for instructions on how to do this see here: GD::Graph

Command line option for v4.1:
phenotype_qc.pl -f <phenotypeFile> [-m <missingValueIdentifier> -i <sampleIDcolumn> -c <column> -s <columnFile> -l <headerLookup> -h -p -k <#columns per chunk>]



-f <phenotypeFile> A tab delimited phenotype file, all rows should contain the same number of columns.
-i <sampleIDcolumn> Set the column that contains the sample ids, column numbers start from 0, default is 0.
-m <missingValueIdentifier>  Set the missing value identifier, default is NA.
-c <column> Specify a single column to check, if using a header file this can be either the human readable or original header.
-s <columnFile> Provide a file containing a list of columns to extract, one per row. As with the single column checking can be either header, takes precedence over -c.
-l <headerLookup> Provide an optional header file for the data set, for use where the headers within the file are not human readable, requires two tab delimited columns, the first column should contain the ids that match the header in phenotypeFile.
-h
Show the help message.

Version 4.1 Only
-k <#columns per chunk> Specify the number of columns to extract in each chunk for processing, default if not specified is 25.
-p
Specify to plot every column identified as numeric, or numeric coded, default is not to plot.


The categories reported by the program are:
  • Numeric: this category reports
    • Total number of non-missing variables.
    • Total number of missing entries.
    • Mean.
    • Median.
    • Standard deviation.
    • Number of entries that are either greater or less than three standard deviations from the mean.
  • Text, Mixed and Coded these categories all report:
    • Total number of non-missing variables.
    • Total number of missing entries.
    • Total number of unique entries, and if the number of unique entries is below 50 these are displayed with counts for each entry.
  • Coded:
    • Entries in this category can be text coded, numeric coded or mixed coded.
    • A column is described as containing a coded set of variables if the total number of unique values is less than 0.1% of the total number of entries, or 20, whichever is the greater.
The latest versions (v3.1 & v4.1) also produce a "fingerprint" summary of each row (assumed sample) and column (assumed phenotype) these are designed to verify whether a sample or phenotype has changed between data releases.
The fingerprint files (prefixed with FP, fingerprint phenotype, and FPS, fingerprint Sample) both contain the following columns:
  • Sample ID or column header.
  • Total numeric values.
  • Total non-numeric values.
  • Total non missing values.
  • Total unique values.
  • Total missing values.
  • Determined type.
These files can be used to determine whether a phenotype has changed between data releases, a script to do this comparison is under development and will be posted as soon as it is finalised.


Excel to Tab

A Perl wrapper using Spreadsheet::Read to convert various spreadsheet formats (xls, xlsx, csv, ods) to tab delimited text files.

Where the format supports sheets the output file name will include the sheet name and there will be one file per sheet.
Depending on your system you may need to add one or more of the following modules to your Perl install:
 Spreadsheet::Read
Spreadsheet::ReadSXC
Spreadsheet::ParseExcel
Spreadsheet::ParseXLSX
Text::CSV_XS


Download
xls2tab.zip

Usage: xls2tab -i <inputfile> [-o <outputfilestem> -f]

-i
inputfile
Path and name of the input file
-o
outputfilestem
Optional, if supplied will form the stem for all the output. For example using MyFile will lead to MyFile.Sheet1.txt, MyFile.Sheet2.txt (assuming there are 2 sheets named Sheet1 and Sheet2). If not supplied the name will be the same as the input, with the sheet name, if it exists, and the file extension, .txt.
-f

Optional, tells the program to select the formatted values from the cells, not the raw values. Default is for raw values