The BGEN format
A compressed binary format for typed and imputed genotype data
v1.2
This page documents version 1.2 of the BGEN format. A more recent version of this specification is available - see here for details.

Overview

A BGEN file consists of a header block, giving general infomation about the file, and an optional sample identifier block. These are followed by a series of variant data blocks, stored consecutively in the file, which each contain data for a single genetic variant. To allow for potential future additions to the spec, the first variant data block is located using an offset stored in the first four bytes of the file.

The format in which variant data blocks are stored is determined by a set of flag bits stored in the header block. Currently two formats are supported - Layout 1 blocks which are a direct translation to binary of the GEN format; and Layout 2 blocks, which are both more space-efficient and more flexible, including support for genotype and haplotype data, multi-allelic variants, and non-diploid samples. An older format, used in the v1.0 spec, is now deprecated and is no longer documented in this spec.

Data types

All numbers in a BGEN file are stored as unsigned integers in little endian (least significant byte first) order. This choice coincides with the memory layout used on most common architectures - see the wikipedia page for more details.

Variant identifiers, chromosome identifiers, and other string fields are stored as a two- or four-byte integer length followed by the data itself (which does not include a C-style trailing zero byte).

Genotype probabilities are stored in an efficient packed bit representation described in detail below.

Finally, some fields in BGEN are interpreted as flags encoded as a bitmask.

The first four bytes

The first four bytes of the file encode an unsigned integer indicating the offset, relative to the 5th byte of the file, of the start of the first variant data block, or the end of the file if there are 0 variant data blocks. For example, if this offset is 20 (the minimum possible because the header block always has size at least 20) then the variant data blocks start at byte 25.

No. of bytes Description
4 An unsigned integer offset indicating the offset, relative to the fifth byte of the file, of the first byte of the first variant data block (or the end of the file if there are no variant data blocks).
4 TOTAL

The header block

The header block contains global information about the file, including the number of samples and the number of variant data blocks the file contains, and flags indicating how data is stored.

No. of bytes Description
4 An unsigned integer LH indicating the length, in bytes, of the header block. This must not be larger than offset.
4 An unsigned integer M indicating the number of variant data blocks stored in the file.
4 An unsigned integer N indicating the number of samples represented in the variant data blocks in the file.
4 'Magic number' bytes. This field should contain the four bytes 'b', 'g', 'e', 'n'. For backwards compatibility, readers should also accept the value 0 (four zero bytes) here.
LH-20 Free data area. This could be used to store, for example, identifying information about the file
4 A set of flags, with bits numbered as for an unsigned integer. See below for flag definitions.
LH TOTAL

Header block -- flag definitions

The following flags can be contained in the flags field in the header block. Note: bits and field values not specified here are reserved for possible future use; they should be set to zero.

Bit Name Value Description
0-1 CompressedSNPBlocks 0 Indicates SNP block probability data is not compressed.
1 Indicates SNP block probability data is compressed using zlib's compress() function.
2-5 Layout
(previously called LongIds)
0 Indicates SNP blocks are layed out according to Layout 0, first used in the v1.0 spec. This allows only single-character alleles. Use of this format is deprecated, in the sense that it should not be used for new files. We will remove this from a future version of the spec.
1 Indicates SNP blocks are layed out according to Layout 1, i.e. as in the v1.1 spec. This allows for multiple characters in alleles and is supported in SNPTEST from version 2.3.0, and in QCTOOL from version 1.1.
2 Indicates SNP blocks are layed out according to Layout 2, introduced in version 1.2 of the spec (i.e. in this document). This format supports multiple alleles, phased and unphased genotypes, explicit specification of ploidy and missing data, and configurable levels of compression.
It is recommended that all new files are stored with Layout=2.
Values > 2 are reserved for future use.
31 SampleIdentifiers 0 Indicates sample identifiers are not stored in this file.
1 Indicates a sample identifier block follows the header. It is recommended that all new files are created with SampleIdentifiers=1.

Sample identifier block

If SampleIdentifiers=1 in the flags field, the header block is immediately followed by a sample identifier block. This stores a single identifier per sample.

Note: BGEN treats sample identifiers as a string of bytes, and does not impose any additional restrictions. However, for the simplest interoperability with other software (e.g. for R's make.names ) it is often sensible to restrict to ASCII alphanumeric characters, underscores, and full stop.

No. of bytes Description
4 An unsigned integer LSI indicating the length in bytes of the sample identifier block. This must satisfy the constraint LSI+LHoffset.
4 An unsigned integer N indicating the number of samples represented in the file. This must be the same as the number N in the header block.
2 An unsigned integer indicating the length Ls1 of the identifier of sample 1.
Ls1 Identifier of sample 1.
2 An unsigned integer indicating the length Ls2 of the identifier of sample 2.
Ls1 Identifier of sample 2.
...
2 An unsigned integer indicating the length LsN of the identifier of sample N.
LsN Identifier of sample N.
LSI = 8 + 2×N + ∑nLsn TOTAL

Variant data blocks

Following the header comes a sequence of M variant data blocks (where M is the number specified in the header block). This document describes SNP blocks for spec versions 1.1 and above. Version 1.0 is deprecated and should not be used in new files.

Variant data blocks are comprised of: a section of identifying data (containing variant IDs, position, and alleles), followed by a section containing the genotype probability data itself. Most files will have CompressedSNPBlocks=1, indicating that genotype probability data is stored compressed. (The variant identifying data is never compressed, however.)

Variant identifying data

No. of bytes Description
4 The number of individuals the row represents, hereafter denoted N. This is only present if Layout=1 (otherwise it appears instead in the genotype probability block below).
2 The length Lid of the variant identifier. (The variant identifier is intended to store e.g. chip manufacturer IDs for assayed SNPs).
Lid The variant identifier.
2 The length Lrsid of the rsid.
Lrsid The rsid.
2 The length Lchr of the chromosome
Lchr The chromosome
4 The variant position, encoded as an unsigned 32-bit integer.
2 The number K of alleles, encoded as an unsigned 16-bit integer. If Layout=1, this field is omitted, and assumed to equal 2.
4 The length La1 of the first allele.
La1 The first allele.
4 The length La2 of the second allele.
La2 The second allele.
... ...(possibly more alleles)...
4 The length LaK of the Kth allele.
LaK The Kth allele.
16 + 4K + Lid + Lrsid + Lchr + ∑kLak + D TOTAL

Genotype data block (Layout 1)

Layout 1 blocks are used when Layout=1. Only two alleles (K=2) are supported. All samples are stored as if diploid; haploid samples should be stored as if having homozygous genotype. Missing samples are encoded as three zero probabilities. This is a direct translation to binary format of a GEN file.

No. of bytesDescription
4 The total length C of the compressed genotype probability data for this variant. Seeking forward this many bytes takes you to the next variant data block. If CompressedSNPBlocks=0 this field is omitted and the length of the uncompressed data is C=6N.
CGenotype probability data for the SNP for each of the N individuals in the cohort in the format described below. If CompressedSNPBlocks=0 this consists of C=6N bytes in the format described below. Otherwise this is C bytes which can be uncompressed using zlib to form 6N bytes stored in the format described below. (Zstandard compression, encoded by the value CompressedSNPBlocks = 2, is not supported for v1.1 style blocks.)
C or C+4TOTAL

Probability data storage

For Layout 1 blocks, probability data is stored as a sequence of 2-byte unsigned integers. These should be interpreted in triples, the first member being the probability of a homozygous 'AA' allele, the second the probability of 'AB', the third the probability of 'BB', where A and B are the two alleles at the variant. When CompressedSNPBlocks is not set, these 6 * N bytes are stored in the file directly. When CompressedSNPBlocks>0, these 6*N bytes are first compressed using zlib and the length of the compressed data is stored as the 4-byte integer C, followed by the compressed data itself.

To convert the stored 2-byte integers into probabilities, the following calculation should be performed:

  1. Convert the number into a floating-point format (e.g. float or double).
  2. Divide by 32,768.

Note that the range of a two-byte unsigned integer is 0 - 65,535 inclusive. Thus the resulting probabilities can take on values between 0 and 65,535/32768 ~ 1.9999 inclusive and they are accurate to four decimal places.

To convert a floating point probability to its integer representation, do the following:

  1. Multiply by 32,768.
  2. Check that the number is in the half-open interval [0,65535.5) and round to the nearest integer.

All numbers are stored in little-endian (least significant byte first) order. Probabilities for samples with missing genotype data should be stored as zero.

Genotype data block (Layout 2)

Layout 2 blocks are used when Layout=2. This format supports arbitrary numbers of alleles (up to 65535), samples of arbitrary ploidy (up to 63), and both phased and unphased data.

No. of bytes Description
4 The total length C of the rest of the data for this variant. Seeking forward this many bytes takes you to the next variant data block.
4 The total length D of the probability data after uncompression. If CompressedSNPBlocks = 0, this field is omitted and the total length of the probability data is D=C.
C or C-4 Genotype probability data for the SNP for each of the N individuals in the cohort. If CompressedSNPBlocks = 0, this is D bytes stored in the format described below. If CompressedSNPBlocks = 1, this is C-4 bytes which can be uncompressed using zlib to form D bytes in the format described below.

Probability data storage

Layout 2 probability data storage is structured as described below. If CompressedSNPBlocks = 0 the structure is stored directly, and C reflects the length of this structure. If CompressedSNPBlocks > 0 the whole structure is stored after compression. In this case D reflects the length of the uncompressed structure and the length of the compressed structure is C-4.

No. of bytes Description
4 The number of individuals for which probability data is stored. This must equal N as defined in the header block.
2 The number of alleles, encoded as an unsigned 16-bit integer. This must equal K as defined in the variant identifying data block.
1 The minimum ploidy Pmin of samples in the row. Values between 0 and 63 are allowed.
1 The maximum ploidy Pmax of samples in the row. Values between 0 and 63 are allowed.
N A list of N bytes, where the nth byte is an unsigned integer representing the ploidy and missingness of the nth sample. Ploidy (possible values 0-63) is encoded in the least significant 6 bits of this value. Missingness is encoded by the most significant bit; thus a value of 1 for the most significant bit indicates that no probability data is stored for this sample.
(Note: there is no way to indicate that the ploidy itself is missing.)
1 Flag, denoted Phased indicating what is stored in the row.
If Phased=1 the row stores one probability per allele (other than the last allele) per haplotype (e.g. to represent phased data).
If Phased=0 the row stores one probability per possible genotype (other than the 'last' genotype where all alleles are the last allele), to represent unphased data.
Any other value for Phased is an error.
1 Unsigned integer B representing the number of bits used to store each probability in this row. This must be between 1 and 32 inclusive.
X Probabilities for each possible haplotype (if Phased=1) or genotype (if Phased=0) for the samples. Each probability is stored in B bits. Values are interpreted by linear interpolation between 0 and 1, i.e. value b corresponds to probability b / ( 2B-1 ). When storing the value, probabilities should be rounded according to the algorithm described below. Probabilities are stored consecutively for samples 1, 2, ..., N. For each sample the order of stored probabilities is described below. Probabilities for samples with missing data (as defined by the missingness/ploidy byte) are written as zeroes (note this represents a change from the earlier draft of this spec; see the rationale below).
D=10+N+∑iPi TOTAL

Per-sample order of stored probabilities

Consider a sample with ploidy Z and a variant with K alleles.

  • For phased data, probabilities are stored in the order of haplotypes and then alleles, ie:
    P11, P12, ..., P1(K-1), P21, ..., P2(K-1), ..., PZ1, ..., PZ(K-1).
    where Pij is the probability that haplotype i has allele j. For each haplotype i the probability of the Kth allele (PiK) is not stored; instead it is inferred as one minus the sum of other probabilities for that haplotype. Thus a total of Z(K-1) probabilities are stored.
  • For unphased data, enumerate the possible genotypes as the set of K-vectors of nonnegative integers (x1, x2, ..., xK), where xi represents the count of the i-th allele in the genotype. Probabilities are stored in colex order of these vectors. The last probability (corresponding the the K-th allele homozygotes) is not stored; instead it is inferred as one minus the sum of other probabilities. Thus a total of ( Z+K-1 ) choose ( K-1 )-1 probabilities is stored.

    Example. For example if Z=3 and K=3 then the enumerated genotypes with allele count representations are:

    IndexGenotypeAllele counts
    0111(3,0,0)
    1112(2,1,0)
    2122(1,2,0)
    3222(0,3,0)
    4113(2,0,1)
    5123(1,1,1)
    6223(0,2,1)
    7133(1,0,2)
    8233(0,1,2)
    9333(0,0,3)

    The stored probabilities are thus

    P111,P112, P122, P222, P113, P123, P223, P133, P233

    with P333 inferred as one minus the sum of the other probabilities.

    The colex order has the important property that the genotypes that for each i the genotypes carrying the i-th allele appear later in the order than those that carry only alleles 1,...,i-1. See the rationale below for a further discussion of this choice of storage order.

Representation of probabilities

For both genotype and haplotype data, each probability value is stored using B bits as follows. An integer of length B bits can represent the values 0, ..., 2B-1 inclusive. To interpret a stored value x as a probability:

  1. Convert x to an integer in floating-point representation.
  2. Divide by 2B-1.
Thus, probabilities stored in Layout 2 blocks take possible values of the form x/(2B-1) ∈ [0,1].

Storing probabilities to the limited precision afforded by B bits requires a rounding rule, which we specify as follows. Given a vector v=(v1, ...vd) of d probabilities that sum to one, we round by finding the closest point to v of the form x/(2B-1) where the entries of x are nonnegative integers summing to (2B-1). The integer vector x can be found by the following algorithm:

  1. Multiply v by 2B-1.
  2. Compute the total fractional part F = ∑i (vi- floor(vi)).
  3. Form x by rounding the F entries of v with the largest fractional parts up to the nearest integer, and the other d-F entries down to the nearest smaller integer.
The results of Bomze et al, 2014 imply that x/(2B-1) is the nearest point to v that can be stored in the BGEN format with B bits.

The maximum error in a probability stored using this rounding rule is 1/(2B-1).

In practice we there may be some rounding error in probabilities input into the BGEN format. We therefore renormalise input probabilities to sum to one.