This page documents v1.0 of the BGEN format, which is now deprecated. This format should not be used for new files. See here for the most recent version of the BGEN format.
Note: The data layout documented here has the limitation that alleles can be only 1 character long. This prevents its use with (for example) the latest 1000 genomes data release, which contains structural variants many kilobases long. Version 1.1 was developed to be backwards-compatible with this format, and addresses this issue.
Each snp block in a BGEN v1.0 file consists of the following data in order.
|No. of bytes||Description|
|4||The number of individuals the row represents, hereafter denoted N. This is an integer encoded in two bytes.|
|1||An unsigned integer S, indicating the length of the storage used for the SNPID and RSID fields in the row.|
|1||The length, SNPID_size of (the data part of) the SNPID string. This must be between 0 and S.|
|S||The SNPID of the row. Only the first SNPID_size bytes will be used.|
|1||The length, RSID_size of (the data part of) the RSID string. This must be between 0 and S.||S||The RSID of the row. Only the first RSID_size bytes will be used.|
The chromosome on which the SNP is found, encoded as an unsigned 8-bit integer.
The encoding is:
|4||The SNP position, encoded as an unsigned 32-bit integer.|
|1||The length LAof the 'A' allele, encoded as an unsigned 8-bit integer.|
|LA||The B allele.|
|1||The length LB fo the B allele, encoded as an unsigned 8-bit integer.|
|LB||The B allele.|
|P||Genotype probability data for the SNP for each of the N individuals in the cohort. If the CompressedSNPBlocks flag is not set, this field consists of P=6*N bytes representing the probabilities. If CompressedSNPBlocks is set, this field contains a 32-bit unsigned integer specifying the length of the compressed data, followed by the compressed data itself. See below for more details.|
|13 + 2*S + P||TOTAL|
The probability data is listed as a sequence of 2-byte signed integers. These should be interpreted in triples, the first member being the probability of AA, the second the probability of AB, the third the probability of BB. Altogether these occupy 6*N bytes where N is the number of samples. These 6 * N bytes are written directly. Alternatively, if the CompressedSNPBlocks flag is set in the header, these 6 * N bytes are first compressed using zlib. The SNP block then contains a 4-byte integer representing the length of the compressed data followed by the compressed data itself.
To convert the stored 2-byte integers into probabilities, the following calculation should be performed:
Note that the range of a two-byte unsigned integer is 0 - 65535 inclusive. Thus the resulting probabilities can take on values between 0 and 6.5535 inclusive and are accurate to four decimal places.
Note: to convert a floating point number to the format, do the following:
All numbers are stored in little-endian (least significant byte first) order.