QCTOOL v2

Sample file formats

SNPTEST sample file format

QCTOOL uses the same sample file format as SNPTEST. This is a space-separated file with two header lines followed by data, as follows.

Header line

This line lists the name of each column. Column names can be arbitrary printable characters, not including whitespace.

Column type line

This line lists the 'type' of each column and is used by QCTOOL to process data in an appropriate way. There must be one type for each column. Allowable column types are:

0 (for the first identifier column)
D (for a column containing discrete values, e.g. a set of strings)
P or C - for columns containing continuous value - each value must be numerical, or a missing value.
B - for a column containing a binary trait. The values in this column must be '0', '1', 'control', or 'case'.

In addition to the first column, QCTOOL optionally allows there to be a column called 'ID_2' and/or a column called 'missing' that have type '0'. Internally these are treated like columns of type 'D'.

Data lines

Data for each sample. There should be one data line per sample in the genotypes file, and these must be in the same order that they appear in the genotypes file.

By default 'NA' is used for missing values in the sample file. Any value in any column that is equal (as a string literal) to "NA" will be treated as missing. (The option -missing-code can be used to alter what is treated as a missing value.)

The first column in a sample file must always be of type '0' and it is always treated as sample identifiers. We strongly recommend that the identifiers used are chosen to be unique across a project.

Relative to the traditional format used in SNPTEST, QCTOOL allows some modifications:

The only mandatory column is the first column, which must of type '0' and contains sample identifiers.

An example of a sample file is:


ID sex case covariate

0 D B C

sample1 M control 0.1

sample2 F control 0.2

sample3 F control -0.15

sample4 F case -0.01

sample4 NA NA 0.025

(etc.)