The BGEN format
A compressed binary format for typed and imputed genotype data
v1.1
This page documents version 1.1 of the BGEN format. A more recent version of this specification is available - see here for details.

Detailed specification

A BGEN file consists of a header block, followed by a series of blocks called snp blocks. The first four bytes of the file indicate the start position of the first snp block (relative to the fifth byte of the file).

Note: All numbers in the file are stored as integers in little endian (least significant byte first) order. This choice coincides with the memory layout used on most common architectures. see the wikipedia page for more details.

The first four bytes

The first four bytes of the file encode an unsigned integer indicating the offset, relative to the 5th byte of the file, of the start of the first snp block (or the end of the file if there are 0 snp blocks). For example, if this offset is 20 (the minimum possible because the header block always has size at least 20) then the snp blocks start at byte 25.

No. of bytesDescription
4An unsigned integer offset indicating the offset, relative to the fifth byte of the file, of the first byte of the first snp block (or the end of the file if there are no snp blocks).
4TOTAL

The header block

The header block contains global information about the file.

No. of bytesDescription
4An unsigned integer H indicating the length, in bytes, of the header block. This must not be larger than offset.
4An unsigned integer indicating the number of snp blocks stored in the file.
4An unsigned integer indicating the number of samples represented in the snp blocks in the file.
4Reserved. (Writers should write 0 here, readers should ignore these bytes.)
H-20Free data area. This could be used to store, for example, identifying information about the file
4A set of flags, with bits numbered as for an unsigned integer. See below for flag definitions.
20 + HTOTAL

Header block -- flag definitions

The following flags can be contained in the flags field in the header block. Note: all bits not listed here must be set to 0.

BitNameValueDescription
0CompressedSNPBlocks0Indicates SNP block probability data is not compressed.
1Indicates SNP block probability data is compressed using zlib's compress() function.
2LongIds0Indicates alleles are stored as single characters. SNP blocks are layed out according to the v1.0 spec.
1Indicates version 1.1 of the SNP block layout is used. This allows for multiple characters in alleles and is supported in SNPTEST from version 2.3.0, and in QCTOOL version 1.1.

SNP blocks

Following the header comes a sequence of 0 or more SNP blocks. Each SNP block consists of the following data in order. (Note: the following description is valid when LongIds=1. When LongIds=0, SNP blocks are layout out as per the v1.0 spec, described here.)

No. of bytesDescription
4The number of individuals the row represents, hereafter denoted N.
2The length LS of the SNP id.
LSThe SNP id.
2The length LR of the rsid.
LRThe rsid.
2The length LC of the chromosome
LCThe chromosome
4The SNP position, encoded as an unsigned 32-bit integer.
4The length LA of the A allele.
LAThe A allele.
4The length LB of the B allele.
LBThe B allele.
PGenotype probability data for the SNP for each of the N individuals in the cohort. If the CompressedSNPBlocks flag is not set, this field consists of P=6*N bytes representing the probabilities. If CompressedSNPBlocks is set, this field contains a 32-bit unsigned integer specifying the length of the compressed data, followed by the compressed data itself. See below for details of the storage scheme used.
21 + LS + LR + LA + LB + PTOTAL

SNP block probability data

The probability data is listed as a sequence of 2-byte unsigned integers. These should be interpreted in triples, the first member being the probability of AA, the second the probability of AB, the third the probability of BB. Altogether these occupy 6*N bytes where N is the number of samples. When CompressedSNPBlocks is not set, these 6 * N bytes are stored directly. When CompressedSNPBlocks is set, these 6 * N bytes are first compressed using zlib, and the length of the compressed data is stored as a 4-byte integer, followed by the compressed data itself.

To convert the stored 2-byte integers into probabilities, the following calculation should be performed:

  1. Convert the number into a floating-point format (e.g. float or double).
  2. Divide by 32,768.

Note that the range of a two-byte unsigned integer is 0 - 65,535 inclusive. Thus the resulting probabilities can take on values between 0 and 65,535/32768 ~ 1.9999 inclusive and they are accurate to four decimal places.

Note: to convert a floating point probability to its integer representation, do the following:

  1. Multiply by 32,768.
  2. Check that the number is in the half-open interval [0,65535.5) and round to the nearest integer.

All numbers are stored in little-endian (least significant byte first) order.