Gavin Band's webspace at the Wellcome Trust Centre for Human Genetics

v1.1

This page documents version 1.1 of the BGEN format. A more recent version of this specification is available - see here for details .

Detailed specification

A BGEN file consists of a header block, followed by a series of blocks called snp blocks. The first four bytes of the file indicate the start position of the first snp block (relative to the fifth byte of the file).

Note: All numbers in the file are stored as integers in little endian (least significant byte first) order. This choice coincides with the memory layout used on most common architectures. see the wikipedia page for more details.

The first four bytes

The first four bytes of the file encode an unsigned integer indicating the offset, relative to the 5th byte of the file, of the start of the first snp block (or the end of the file if there are 0 snp blocks). For example, if this offset is 20 (the minimum possible because the header block always has size at least 20) then the snp blocks start at byte 25.

No. of bytes	Description
4	An unsigned integer offset indicating the offset, relative to the fifth byte of the file, of the first byte of the first snp block (or the end of the file if there are no snp blocks).
4	TOTAL

The header block

The header block contains global information about the file.

No. of bytes	Description
4	An unsigned integer H indicating the length, in bytes, of the header block. This must not be larger than offset.
4	An unsigned integer indicating the number of snp blocks stored in the file.
4	An unsigned integer indicating the number of samples represented in the snp blocks in the file.
4	Reserved. (Writers should write 0 here, readers should ignore these bytes.)
H-20	Free data area. This could be used to store, for example, identifying information about the file
4	A set of flags, with bits numbered as for an unsigned integer. See below for flag definitions.
20 + H	TOTAL

Header block -- flag definitions

The following flags can be contained in the flags field in the header block. Note: all bits not listed here must be set to 0.

Bit	Name	Value	Description
0	CompressedSNPBlocks	0	Indicates SNP block probability data is not compressed.
		1	Indicates SNP block probability data is compressed using zlib's compress() function.
2	LongIds	0	Indicates alleles are stored as single characters. SNP blocks are layed out according to the v1.0 spec.
		1	Indicates version 1.1 of the SNP block layout is used. This allows for multiple characters in alleles and is supported in SNPTEST from version 2.3.0, and in QCTOOL version 1.1.

SNP blocks

Following the header comes a sequence of 0 or more SNP blocks. Each SNP block consists of the following data in order. (Note: the following description is valid when LongIds=1. When LongIds=0, SNP blocks are layout out as per the v1.0 spec, described here.)

No. of bytes	Description
4	The number of individuals the row represents, hereafter denoted N.
2	The length LS of the SNP id.
LS	The SNP id.
2	The length LR of the rsid.
LR	The rsid.
2	The length LC of the chromosome
LC	The chromosome
4	The SNP position, encoded as an unsigned 32-bit integer.
4	The length LA of the A allele.
LA	The A allele.
4	The length LB of the B allele.
LB	The B allele.
P	Genotype probability data for the SNP for each of the N individuals in the cohort. If the CompressedSNPBlocks flag is not set, this field consists of P=6N bytes representing the probabilities. If CompressedSNPBlocks* is set, this field contains a 32-bit unsigned integer specifying the length of the compressed data, followed by the compressed data itself. See below for details of the storage scheme used.
21 + LS + LR + LA + LB + P	TOTAL

SNP block probability data

The probability data is listed as a sequence of 2-byte unsigned integers. These should be interpreted in triples, the first member being the probability of AA, the second the probability of AB, the third the probability of BB. Altogether these occupy 6*N bytes where N is the number of samples. When CompressedSNPBlocks is not set, these 6 * N bytes are stored directly. When CompressedSNPBlocks is set, these 6 * N bytes are first compressed using zlib, and the length of the compressed data is stored as a 4-byte integer, followed by the compressed data itself.

To convert the stored 2-byte integers into probabilities, the following calculation should be performed:

Convert the number into a floating-point format (e.g. float or double).
Divide by 32,768.

Note that the range of a two-byte unsigned integer is 0 - 65,535 inclusive. Thus the resulting probabilities can take on values between 0 and 65,535/32768 ~ 1.9999 inclusive and they are accurate to four decimal places.

Note: to convert a floating point probability to its integer representation, do the following:

Multiply by 32,768.
Check that the number is in the half-open interval [0,65535.5) and round to the nearest integer.

All numbers are stored in little-endian (least significant byte first) order.