Overview
A BGEN file consists of a header block, giving general infomation about the file, and an optional sample identifier block. These are followed by a series of variant data blocks, stored consecutively in the file, which each contain data for a single genetic variant. To allow for potential future additions to the spec, the first variant data block is located using an offset stored in the first four bytes of the file.
The format in which variant data blocks are stored is determined by a set of flag bits stored in the header block. Currently two formats are supported  Layout 1 blocks which are a direct translation to binary of the GEN format; and Layout 2 blocks, which are both more spaceefficient and more flexible, including support for genotype and haplotype data, multiallelic variants, and nondiploid samples.
Data types
All numbers in a BGEN file are stored as unsigned integers in little endian (least significant byte first) order. This choice coincides with the memory layout used on most common architectures  see the wikipedia page for more details.
Variant identifiers, chromosome identifiers, and other string fields are stored as a two or fourbyte integer length followed by the data itself (which does not include a Cstyle trailing zero byte).
Genotype probabilities are stored in an efficient packed bit representation described in detail below.
Finally, some fields in BGEN are interpreted as flags encoded as a bitmask.
The first four bytes
The first four bytes of the file encode an unsigned integer indicating the offset, relative to the 5th byte of the file, of the start of the first variant data block, or the end of the file if there are 0 variant data blocks. For example, if this offset is 20 (the minimum possible because the header block always has size at least 20) then the variant data blocks start at byte 25.
No. of bytes  Description 

4  An unsigned integer offset indicating the offset, relative to the fifth byte of the file, of the first byte of the first variant data block (or the end of the file if there are no variant data blocks). 
4  TOTAL 
The header block
The header block contains global information about the file, including the number of samples and the number of variant data blocks the file contains, and flags indicating how data is stored.
No. of bytes  Description 

4  An unsigned integer L_{H} indicating the length, in bytes, of the header block. This must not be larger than offset. 
4  An unsigned integer M indicating the number of variant data blocks stored in the file. 
4  An unsigned integer N indicating the number of samples represented in the variant data blocks in the file. 
4  'Magic number' bytes. This field should contain the four bytes 'b', 'g', 'e', 'n'. For backwards compatibility, readers should also accept the value 0 (four zero bytes) here. 
L_{H}20  Free data area. This could be used to store, for example, identifying information about the file 
4  A set of flags, with bits numbered as for an unsigned integer. See below for flag definitions. 
L_{H}  TOTAL 
Header block  flag definitions
The following flags can be contained in the flags field in the header block. Note: bits and field values not specified here are reserved for possible future use.
Bit  Name  Value  Description 

01  CompressedSNPBlocks  0  Indicates SNP block probability data is not compressed. 
1  Indicates SNP block probability data is compressed using zlib's compress() function.  
2  Indicates SNP block probability data is compressed using zstandard's ZSTD_compress() function.  
25  Layout (previously called LongIds) 
0  This value is not supported. 
1  Indicates SNP blocks are layed out according to Layout 1, i.e. as in the v1.1 spec. This allows for multiple characters in alleles and is supported in SNPTEST from version 2.3.0, and in QCTOOL from version 1.1.  
2  Indicates SNP blocks are layed out according to Layout 2, introduced
in version 1.2 of the spec. This format supports multiple alleles, phased and
unphased genotypes, explicit specification of ploidy and missing data, and
configurable levels of compression. It is recommended that all new files are stored with Layout=2. 

Values > 2 are reserved for future use.  
31  SampleIdentifiers  0  Indicates sample identifiers are not stored in this file. 
1  Indicates a sample identifier block follows the header. It is recommended that all new files are created with SampleIdentifiers=1. 
Sample identifier block
If SampleIdentifiers=1 in the flags field, the header block is immediately followed by a sample identifier block. This stores a single identifier per sample.
Note: BGEN treats sample identifiers as a string of bytes, and does not impose any additional restrictions. However, for the simplest interoperability with other software (e.g. for R's make.names ) it is often sensible to restrict to ASCII alphanumeric characters, underscores, and full stop.
No. of bytes  Description 

4  An unsigned integer L_{SI} indicating the length in bytes of the sample identifier block. This must satisfy the constraint L_{SI}+L_{H} ≤ offset. 
4  An unsigned integer N indicating the number of samples represented in the file. This must be the same as the number N in the header block. 
2  An unsigned integer indicating the length L_{s1} of the identifier of sample 1. 
L_{s1}  Identifier of sample 1. 
2  An unsigned integer indicating the length L_{s2} of the identifier of sample 2. 
L_{s1}  Identifier of sample 2. 
...  
2  An unsigned integer indicating the length L_{sN} of the identifier of sample N. 
L_{sN}  Identifier of sample N. 
L_{SI} = 8 + 2×N + ∑_{n}L_{sn}  TOTAL 
Variant data blocks
Following the header comes a sequence of M variant data blocks (where M is the number specified in the header block). This document describes SNP blocks for version 1.1 (specified by the Layout=1 flag) and 1.2 (Layout=2 flag) of the BGEN spec.
Variant data blocks are comprised of: a section of identifying data (containing variant IDs, position, and alleles), followed by a section containing the genotype probability data itself. Most files will have CompressedSNPBlocks=1, indicating that genotype probability data is stored compressed. (The variant identifying data is never compressed, however.)
Variant identifying data
No. of bytes  Description 

4  The number of individuals the row represents, hereafter denoted N. This is only present if Layout=1 (otherwise it appears instead in the genotype probability block below). 
2  The length L_{id} of the variant identifier. (The variant identifier is intended to store e.g. chip manufacturer IDs for assayed SNPs). 
L_{id}  The variant identifier. 
2  The length L_{rsid} of the rsid. 
L_{rsid}  The rsid. 
2  The length L_{chr} of the chromosome 
L_{chr}  The chromosome 
4  The variant position, encoded as an unsigned 32bit integer. 
2  The number K of alleles, encoded as an unsigned 16bit integer. If Layout=1, this field is omitted, and assumed to equal 2. 
4  The length L_{a1} of the first allele. 
L_{a1}  The first allele. 
4  The length L_{a2} of the second allele. 
L_{a2}  The second allele. 
...  ...(possibly more alleles)... 
4  The length L_{aK} of the Kth allele. 
L_{aK}  The Kth allele. 
16 + 4K + L_{id} + L_{rsid} + L_{chr} + ∑_{k}L_{ak} + D  TOTAL 
Genotype data block (Layout 1)
Layout 1 blocks are used when Layout=1. Only two alleles (K=2) are supported. All samples are stored as if diploid; haploid samples should be stored as if having homozygous genotype. Missing samples are encoded as three zero probabilities. This is a direct translation to binary format of a GEN file.
No. of bytes  Description 

4  The total length C of the compressed genotype probability data for this variant. Seeking forward this many bytes takes you to the next variant data block. If CompressedSNPBlocks=0 this field is omitted and the length of the uncompressed data is C=6N. 
C  Genotype probability data for the SNP for each of the N individuals in the cohort in the format described below. If CompressedSNPBlocks=0 this consists of C=6N bytes in the format described below. Otherwise this is C bytes which can be uncompressed using zlib to form 6N bytes stored in the format described below. (Zstandard compression, encoded by the value CompressedSNPBlocks = 2, is not supported for Layout 1 blocks.) 
C or C+4  TOTAL 
Probability data storage
For Layout 1 blocks, probability data is stored as a sequence of 2byte unsigned integers. These should be interpreted in triples, the first member being the probability of a homozygous 'AA' allele, the second the probability of 'AB', the third the probability of 'BB', where A and B are the two alleles at the variant. When CompressedSNPBlocks is not set, these 6 * N bytes are stored in the file directly. When CompressedSNPBlocks>0, these 6*N bytes are first compressed using zlib or zstandard, and the length of the compressed data is stored as the 4byte integer C, followed by the compressed data itself.
To convert the stored 2byte integers into probabilities, the following calculation should be performed:
 Convert the number into a floatingpoint format (e.g. float or double).
 Divide by 32,768.
Note that the range of a twobyte unsigned integer is 0  65,535 inclusive. Thus the resulting probabilities can take on values between 0 and 65,535/32768 ~ 1.9999 inclusive and they are accurate to four decimal places.
To convert a floating point probability to its integer representation, do the following:
 Multiply by 32,768.
 Check that the number is in the halfopen interval [0,65535.5) and round to the nearest integer.
All numbers are stored in littleendian (least significant byte first) order. Probabilities for samples with missing genotype data should be stored as zero.
Genotype data block (Layout 2)
Layout 2 blocks are used when Layout=2. This format supports arbitrary numbers of alleles (up to 65535), samples of arbitrary ploidy (up to 63), and both phased and unphased data.
No. of bytes  Description 

4  The total length C of the rest of the data for this variant. Seeking forward this many bytes takes you to the next variant data block. 
4  The total length D of the probability data after uncompression. If CompressedSNPBlocks = 0, this field is omitted and the total length of the probability data is D=C. 
C or C4  Genotype probability data for the SNP for each of the N individuals in the cohort. If CompressedSNPBlocks = 0, this is D bytes stored in the format described below. If CompressedSNPBlocks is nonzero, this is C4 bytes which can be uncompressed to form D bytes in the format described below. (Compression uses either zlib or zstd according to the value of CompressedSNPBlocks; see the header block documentation. 
Probability data storage
Layout 2 probability data storage is structured as described below. If CompressedSNPBlocks = 0 the structure is stored directly, and C reflects the length of this structure. If CompressedSNPBlocks > 0 the whole structure is stored after compression. In this case D reflects the length of the uncompressed structure and the length of the compressed structure is C4.
No. of bytes  Description 

4  The number of individuals for which probability data is stored. This must equal N as defined in the header block. 
2  The number of alleles, encoded as an unsigned 16bit integer. This must equal K as defined in the variant identifying data block. 
1  The minimum ploidy P_{min} of samples in the row. Values between 0 and 63 are allowed. 
1  The maximum ploidy P_{max} of samples in the row. Values between 0 and 63 are allowed. 
N  A list of N bytes, where the nth byte is an unsigned
integer representing the ploidy and missingness of the nth sample.
Ploidy (possible values 063) is encoded in the least significant 6 bits of
this value. Missingness is encoded by the most significant bit; thus a value of
1 for the most significant bit indicates that no probability data is stored for
this sample. (Note: there is no way to indicate that the ploidy itself is missing.) 
1  Flag, denoted Phased indicating what is stored in the row. If Phased=1 the row stores one probability per allele (other than the last allele) per haplotype (e.g. to represent phased data). If Phased=0 the row stores one probability per possible genotype (other than the 'last' genotype where all alleles are the last allele), to represent unphased data. Any other value for Phased is an error. 
1  Unsigned integer B representing the number of bits used to store each probability in this row. This must be between 1 and 32 inclusive. 
X  Probabilities for each possible haplotype (if Phased=1) or genotype (if Phased=0) for the samples. Each probability is stored in B bits. Values are interpreted by linear interpolation between 0 and 1, i.e. value b corresponds to probability b / ( 2^{B}1 ). When storing the value, probabilities should be rounded according to the algorithm described below. Probabilities are stored consecutively for samples 1, 2, ..., N. For each sample the order of stored probabilities is described below. Probabilities for samples with missing data (as defined by the missingness/ploidy byte) are written as zeroes (note this represents a change from the earlier draft of this spec; see the rationale below). 
D=10+N+∑_{i}P_{i}  TOTAL 
Persample order of stored probabilities
Consider a sample with ploidy Z and a variant with K alleles.
 For phased data, probabilities are stored in the order of haplotypes and
then alleles, ie:
P_{11}, P_{12}, ..., P_{1(K1)}, P_{21}, ..., P_{2(K1)}, ..., P_{Z1}, ..., P_{Z(K1)}.where P_{ij} is the probability that haplotype i has allele j. For each haplotype i the probability of the Kth allele (P_{iK}) is not stored; instead it is inferred as one minus the sum of other probabilities for that haplotype. Thus a total of Z(K1) probabilities are stored.

For unphased data, enumerate the possible genotypes as the set of Kvectors of nonnegative integers (x_{1}, x_{2}, ..., x_{K}), where x_{i} represents the count of the ith allele in the genotype. Probabilities are stored in colex order of these vectors. The last probability (corresponding the the Kth allele homozygotes) is not stored; instead it is inferred as one minus the sum of other probabilities. Thus a total of ( Z+K1 )
choose
( K1 )1 probabilities is stored.Example. For example if Z=3 and K=3 then the enumerated genotypes with allele count representations are:
Index Genotype Allele counts 0 111 (3,0,0) 1 112 (2,1,0) 2 122 (1,2,0) 3 222 (0,3,0) 4 113 (2,0,1) 5 123 (1,1,1) 6 223 (0,2,1) 7 133 (1,0,2) 8 233 (0,1,2) 9 333 (0,0,3) The stored probabilities are thus
P_{111},P_{112}, P_{122}, P_{222}, P_{113}, P_{123}, P_{223}, P_{133}, P_{233}with P_{333} inferred as one minus the sum of the other probabilities.
The colex order has the important property that the genotypes that for each i the genotypes carrying the ith allele appear later in the order than those that carry only alleles 1,...,i1. See the rationale below for a further discussion of this choice of storage order.
Representation of probabilities
For both genotype and haplotype data, each probability value is stored using B bits as follows. An integer of length B bits can represent the values 0, ..., 2^{B}1 inclusive. To interpret a stored value x as a probability:
 Convert x to an integer in floatingpoint representation.
 Divide by 2^{B}1.
Storing probabilities to the limited precision afforded by B bits requires a rounding rule, which we specify as follows. Given a vector v=(v_{1}, ...v_{d}) of d probabilities that sum to one, we round by finding the closest point to v of the form x/(2^{B}1) where the entries of x are nonnegative integers summing to (2^{B}1). The integer vector x can be found by the following algorithm:
 Multiply v by 2^{B}1.
 Compute the total fractional part F = ∑_{i} (v_{i} floor(v_{i})).
 Form x by rounding the F entries of v with the largest fractional parts up to the nearest integer, and the other dF entries down to the nearest smaller integer.
The maximum error in a probability stored using this rounding rule is 1/(2^{B}1).
In practice we there may be some rounding error in probabilities input into the BGEN format. We therefore renormalise input probabilities to sum to one.