The BGEN format
A compressed binary format for typed and imputed genotype data

This page documents v1.0 of the BGEN format, which is now deprecated. This format should not be used for new files. See here for the most recent version of the BGEN format.

Note: The data layout documented here has the limitation that alleles can be only 1 character long. This prevents its use with (for example) the latest 1000 genomes data release, which contains structural variants many kilobases long. Version 1.1 was developed to be backwards-compatible with this format, and addresses this issue.

SNP block format

Each snp block in a BGEN v1.0 file consists of the following data in order.

No. of bytesDescription
4The number of individuals the row represents, hereafter denoted N. This is an integer encoded in two bytes.
1An unsigned integer S, indicating the length of the storage used for the SNPID and RSID fields in the row.
1The length, SNPID_size of (the data part of) the SNPID string. This must be between 0 and S.
SThe SNPID of the row. Only the first SNPID_size bytes will be used.
1The length, RSID_size of (the data part of) the RSID string. This must be between 0 and S.
SThe RSID of the row. Only the first RSID_size bytes will be used.
1 The chromosome on which the SNP is found, encoded as an unsigned 8-bit integer. The encoding is:
1-22:SNP lies in the chromosome with the given number.
23SNP lies in the non pseudo-autosomal part of the X chromosome.
24:SNP lies in the non pseudo-autosomal part of the Y chromosome.
253:SNP lies in the pseudo-autosomal region of the X/Y chromosomes.
254:SNP lies in the mitochondrial DNA
255:Indicates the chromosome is unknown. We advise that this only be used for test data.
4The SNP position, encoded as an unsigned 32-bit integer.
1The length LAof the 'A' allele, encoded as an unsigned 8-bit integer.
LAThe B allele.
1The length LB fo the B allele, encoded as an unsigned 8-bit integer.
LBThe B allele.
PGenotype probability data for the SNP for each of the N individuals in the cohort. If the CompressedSNPBlocks flag is not set, this field consists of P=6*N bytes representing the probabilities. If CompressedSNPBlocks is set, this field contains a 32-bit unsigned integer specifying the length of the compressed data, followed by the compressed data itself. See below for more details.
13 + 2*S + PTOTAL

SNP block probability data

The probability data is listed as a sequence of 2-byte signed integers. These should be interpreted in triples, the first member being the probability of AA, the second the probability of AB, the third the probability of BB. Altogether these occupy 6*N bytes where N is the number of samples. These 6 * N bytes are written directly. Alternatively, if the CompressedSNPBlocks flag is set in the header, these 6 * N bytes are first compressed using zlib. The SNP block then contains a 4-byte integer representing the length of the compressed data followed by the compressed data itself.

To convert the stored 2-byte integers into probabilities, the following calculation should be performed:

  1. Convert the number into a floating-point format (e.g. float or double).
  2. Divide by 10,000.

Note that the range of a two-byte unsigned integer is 0 - 65535 inclusive. Thus the resulting probabilities can take on values between 0 and 6.5535 inclusive and are accurate to four decimal places.

Note: to convert a floating point number to the format, do the following:

  1. Check the number lies in the half-open interval [ 0, 6.55355 ).
  2. Multiply by 10000 and round to the nearest integer.

All numbers are stored in little-endian (least significant byte first) order.