The BGEN format
A compressed binary format for typed and imputed genotype data
v1.1
This page documents version 1.1 of the BGEN format. A more recent version of this document is available - see the latest BGEN specification.

Introduction

Background

Modern genetic association studies routinely employ data on tens to hundreds of thousands of individuals, genotyped or imputed at tens of millions of markers genome-wide. Traditional data formats based on text representation of these data - such as the GEN format output by IMPUTE, or the Variant Call Format - are sometimes not well suited to these data quantities. For simple programs the time spent parsing these formats can dominate program execution time.

This page describes a binary GEN file format (the "BGEN" format) which aims to address these problems. BGEN is a robust format that has been designed to have a specific blend of features that we believe make it useful for this type of study. It is targetted for use with large, potentially imputed genetic datasets. Key features include:

  • The ability store both directly typed and imputed data.
  • Small file sizes through the use of an efficient representation of probability data and compression.
  • The use of per-variant compression makes the format simple to index and easy to catalogue.
  • The BGEN format has been used in several major projects, including the Wellcome Trust Case-Control Consortium 2 and the MalariaGEN project. It will be the release format for genome-wide genotype data for the UK Biobank.

    Acknowledgements. The following people contributed to the design and implementation of the BGEN format:

    Tool support

    A freely available C++ implementation of the BGEN format is available in the "genfile" sublibrary of QCTOOL, available here.

    Change history

    v1.1 (March 2012):
    BGEN format that is designed to cope with the long alleles present at indels and structural variants in recent releases of the 1000 genomes project. Features of this version are:
    • Support for biallelic SNPs and indels with alleles of arbitrary length (up to 232-1).
    • Store probabilities to at least 4 decimal places worth' of accuracy
    v1.0 (2009):
    The original BGEN format. This version is now deprecated and will be removed from a future version of this spec; there probably aren't any files in the wild in this format.

    Detailed specification

    A BGEN file consists of a header block, followed by a series of blocks called snp blocks. The first four bytes of the file indicate the start position of the first snp block (relative to the fifth byte of the file).

    Note: All numbers in the file are stored as integers in little endian (least significant byte first) order. This choice coincides with the memory layout used on most common architectures. see the wikipedia page for more details.

    The first four bytes

    The first four bytes of the file encode an unsigned integer indicating the offset, relative to the 5th byte of the file, of the start of the first snp block (or the end of the file if there are 0 snp blocks). For example, if this offset is 20 (the minimum possible because the header block always has size at least 20) then the snp blocks start at byte 25.

    No. of bytesDescription
    4An unsigned integer offset indicating the offset, relative to the fifth byte of the file, of the first byte of the first snp block (or the end of the file if there are no snp blocks).
    4TOTAL

    The header block

    The header block contains global information about the file.

    No. of bytesDescription
    4An unsigned integer H indicating the length, in bytes, of the header block. This must not be larger than offset.
    4An unsigned integer indicating the number of snp blocks stored in the file.
    4An unsigned integer indicating the number of samples represented in the snp blocks in the file.
    4Reserved. (Writers should write 0 here, readers should ignore these bytes.)
    H-20Free data area. This could be used to store, for example, identifying information about the file
    4A set of flags, with bits numbered as for an unsigned integer. See below for flag definitions.
    20 + HTOTAL

    Header block -- flag definitions

    The following flags can be contained in the flags field in the header block. Note: all bits not listed here must be set to 0.

    BitNameValueDescription
    0CompressedSNPBlocks0Indicates SNP block probability data is not compressed.
    1Indicates SNP block probability data is compressed using zlib's compress() function.
    2LongIds0Indicates alleles are stored as single characters. SNP blocks are layed out according to the v1.0 spec.
    1Indicates version 1.1 of the SNP block layout is used. This allows for multiple characters in alleles and is supported in SNPTEST from version 2.3.0, and in QCTOOL version 1.1.

    SNP blocks

    Following the header comes a sequence of 0 or more SNP blocks. Each SNP block consists of the following data in order. (Note: the following description is valid when LongIds=1. When LongIds=0, SNP blocks are layout out as per the v1.0 spec, described here.)

    No. of bytesDescription
    4The number of individuals the row represents, hereafter denoted N.
    2The length LS of the SNP id.
    LSThe SNP id.
    2The length LR of the rsid.
    LRThe rsid.
    2The length LC of the chromosome
    LCThe chromosome
    4The SNP position, encoded as an unsigned 32-bit integer.
    4The length LA of the A allele.
    LAThe A allele.
    4The length LB of the B allele.
    LBThe B allele.
    PGenotype probability data for the SNP for each of the N individuals in the cohort. If the CompressedSNPBlocks flag is not set, this field consists of P=6*N bytes representing the probabilities. If CompressedSNPBlocks is set, this field contains a 32-bit unsigned integer specifying the length of the compressed data, followed by the compressed data itself. See below for details of the storage scheme used.
    21 + LS + LR + LA + LB + PTOTAL

    SNP block probability data

    The probability data is listed as a sequence of 2-byte unsigned integers. These should be interpreted in triples, the first member being the probability of AA, the second the probability of AB, the third the probability of BB. Altogether these occupy 6*N bytes where N is the number of samples. When CompressedSNPBlocks is not set, these 6 * N bytes are stored directly. When CompressedSNPBlocks is set, these 6 * N bytes are first compressed using zlib, and the length of the compressed data is stored as a 4-byte integer, followed by the compressed data itself.

    To convert the stored 2-byte integers into probabilities, the following calculation should be performed:

    1. Convert the number into a floating-point format (e.g. float or double).
    2. Divide by 32,768.

    Note that the range of a two-byte unsigned integer is 0 - 65,535 inclusive. Thus the resulting probabilities can take on values between 0 and 65,535/32768 ~ 1.9999 inclusive and they are accurate to four decimal places.

    Note: to convert a floating point probability to its integer representation, do the following:

    1. Multiply by 32,768.
    2. Check that the number is in the half-open interval [0,65535.5) and round to the nearest integer.

    All numbers are stored in little-endian (least significant byte first) order.