|
Gavin Band's webspace
A Binary GEN file format - BGENThis page documents the SNP block layout for version 1.0 of the BGEN format. For the latest version of the BGEN format, see here. Note: The data layout documented here has the limitation that alleles can be only 1 character long. This prevents its use with (for example) the latest 1000 genomes data release, which contains structural variants many kilobases long. Version 1.1 was developed to be backwards-compatible with this format, and addresses this issue. The snp blocksFollowing the header comes a sequence of 0 or more snp blocks. Each snp block consists of the following data in order.
SNP block probability dataThe probability data is listed as a sequence of 2-byte signed integers. These should be interpreted in triples, the first member being the probability of AA, the second the probability of AB, the third the probability of BB. Altogether these occupy 6*N bytes where N is the number of samples. These 6 * N bytes are written directly. Alternatively, if the CompressedSNPBlocks flag is set in the header, these 6 * N bytes are first compressed using zlib. The SNP block then contains a 4-byte integer representing the length of the compressed data followed by the compressed data itself. To convert the stored 2-byte integers into probabilities, the following calculation should be performed:
Note that the range of a two-byte unsigned integer is 0 - 65535 inclusive. Thus the resulting probabilities can take on values between 0 and 6.5535 inclusive and are accurate to four decimal places. Note: to convert a floating point number to the format, do the following:
All numbers are stored in little-endian (least significant byte first) order. |