Modern genetic association studies routinely employ data on tens to hundreds of thousands of individuals, genotyped or imputed at tens of millions of markers genome-wide. Traditional data formats based on text representation of these data - such as the GEN format output by IMPUTE, or the Variant Call Format - are sometimes not well suited to these data quantities. For simple programs the time spent parsing these formats can dominate program execution time.
This page describes a binary GEN file format (the "BGEN" format) which aims to address these problems. BGEN is a robust format that has been designed to have a specific blend of features that we believe make it useful for this type of study. It is targetted for use with large, potentially imputed genetic datasets. Key features include:
The BGEN format has been used in several major projects, including the Wellcome Trust Case-Control Consortium 2 and the MalariaGEN project. It will be the release format for genome-wide genotype data for the UK Biobank.
A freely available C++ implementation of the BGEN format is available in the "genfile" sublibrary of QCTOOL, available here.
A BGEN file consists of a header block, followed by a series of blocks called snp blocks. The first four bytes of the file indicate the start position of the first snp block (relative to the fifth byte of the file).
Note: All numbers in the file are stored as integers in little endian (least significant byte first) order. This choice coincides with the memory layout used on most common architectures. see the wikipedia page for more details.
The first four bytes of the file encode an unsigned integer indicating the offset, relative to the 5th byte of the file, of the start of the first snp block (or the end of the file if there are 0 snp blocks). For example, if this offset is 20 (the minimum possible because the header block always has size at least 20) then the snp blocks start at byte 25.
|No. of bytes||Description|
|4||An unsigned integer offset indicating the offset, relative to the fifth byte of the file, of the first byte of the first snp block (or the end of the file if there are no snp blocks).|
The header block contains global information about the file.
|No. of bytes||Description|
|4||An unsigned integer H indicating the length, in bytes, of the header block. This must not be larger than offset.|
|4||An unsigned integer indicating the number of snp blocks stored in the file.|
|4||An unsigned integer indicating the number of samples represented in the snp blocks in the file.|
|4||Reserved. (Writers should write 0 here, readers should ignore these bytes.)|
|H-20||Free data area. This could be used to store, for example, identifying information about the file|
|4||A set of flags, with bits numbered as for an unsigned integer. See below for flag definitions.|
|20 + H||TOTAL|
The following flags can be contained in the flags field in the header block. Note: all bits not listed here must be set to 0.
|0||CompressedSNPBlocks||0||Indicates SNP block probability data is not compressed.|
|1||Indicates SNP block probability data is compressed using zlib's compress() function.|
|2||LongIds||0||Indicates alleles are stored as single characters. SNP blocks are layed out according to the v1.0 spec.|
|1||Indicates version 1.1 of the SNP block layout is used. This allows for multiple characters in alleles and is supported in SNPTEST from version 2.3.0, and in QCTOOL version 1.1.|
Following the header comes a sequence of 0 or more SNP blocks. Each SNP block consists of the following data in order. (Note: the following description is valid when LongIds=1. When LongIds=0, SNP blocks are layout out as per the v1.0 spec, described here.)
|No. of bytes||Description|
|4||The number of individuals the row represents, hereafter denoted N.|
|2||The length LS of the SNP id.|
|LS||The SNP id.|
|2||The length LR of the rsid.||LR||The rsid.|
|2||The length LC of the chromosome|
|4||The SNP position, encoded as an unsigned 32-bit integer.|
|4||The length LA of the A allele.|
|LA||The A allele.|
|4||The length LB of the B allele.|
|LB||The B allele.|
|P||Genotype probability data for the SNP for each of the N individuals in the cohort. If the CompressedSNPBlocks flag is not set, this field consists of P=6*N bytes representing the probabilities. If CompressedSNPBlocks is set, this field contains a 32-bit unsigned integer specifying the length of the compressed data, followed by the compressed data itself. See below for details of the storage scheme used.|
|21 + LS + LR + LA + LB + P||TOTAL|
The probability data is listed as a sequence of 2-byte unsigned integers. These should be interpreted in triples, the first member being the probability of AA, the second the probability of AB, the third the probability of BB. Altogether these occupy 6*N bytes where N is the number of samples. When CompressedSNPBlocks is not set, these 6 * N bytes are stored directly. When CompressedSNPBlocks is set, these 6 * N bytes are first compressed using zlib, and the length of the compressed data is stored as a 4-byte integer, followed by the compressed data itself.
To convert the stored 2-byte integers into probabilities, the following calculation should be performed:
Note that the range of a two-byte unsigned integer is 0 - 65,535 inclusive. Thus the resulting probabilities can take on values between 0 and 65,535/32768 ~ 1.9999 inclusive and they are accurate to four decimal places.
Note: to convert a floating point probability to its integer representation, do the following:
All numbers are stored in little-endian (least significant byte first) order.