|
Gavin Band's webspace
A Binary GEN file format - BGENThis page documents version 1.1 of the BGEN format.
Change history
BackgroundA GEN file typically contains millions of floating-point numbers -- often in a fixed format -- stored in the file in a textual representation. For example, for a cohort of 1500 individuals, typed at 30000 SNPs, to store the AA, AB and BB genotype probabilities takes 1500 x 30000 x 3 = 135 million floating-point numbers. Consequently, programs which manipulate this data must spend a long time parsing the numbers to produce in-memory float or double quantities. For simple programs this time can dominate the program execution time. This page describes a binary GEN file format (the "BGEN" format) which overcomes this problem. Tests show that using this binary format can achieve a file input speed increase of 5-10x. Genotype data is stored compressed so that BGEN files typically take up no more space (and usually less space) than a corresponding gzipped GEN file. The current specification has been updated to handle long alleles such as those present in the latest 1000 genomes release. A C++ implementation of this file format is available as the "genfile" sublibrary of QCTOOL, available here. OverviewA BGEN file consists of a header block, followed by a series of blocks called snp blocks. The first four bytes of the file indicate the start position of the first snp block (relative to the fifth byte of the file). Note: All numbers in the file are stored as integers in little endian (least significant byte first) order. This choice coincides with the memory layout used on most common architectures. see the wikipedia page for more details. Detailed specificationThe first four bytesThe first four bytes of the file encode an unsigned integer indicating the offset, relative to the 5th byte of the file, of the start of the first snp block (or the end of the file if there are 0 snp blocks). For example, if this offset is 0, the snp blocks start at byte 5.
The header blockThe header block contains global information about the file.
Header block -- flag definitionsThe following flags can be contained in the flags field in the header block. Note: all bits not listed here must be set to 0.
SNP blocksFollowing the header comes a sequence of 0 or more SNP blocks. Each SNP block consists of the following data in order. (Note: the following description is valid when LongIds=1. When LongIds=0, SNP blocks are layout out as per the v1.0 spec, described here.)
SNP block probability dataThe probability data is listed as a sequence of 2-byte unsigned integers. These should be interpreted in triples, the first member being the probability of AA, the second the probability of AB, the third the probability of BB. Altogether these occupy 6*N bytes where N is the number of samples. When CompressedSNPBlocks is not set, these 6 * N bytes are stored directly. When CompressedSNPBlocks is set, these 6 * N bytes are first compressed using zlib, and the length of the compressed data is stored as a 4-byte integer, followed by the compressed data itself. To convert the stored 2-byte integers into probabilities, the following calculation should be performed:
Note that the range of a two-byte unsigned integer is 0 - 65,535 inclusive. Thus the resulting probabilities can take on values between 0 and 65,535/32768 ~ 1.9999 inclusive and they are accurate to four decimal places. Note: to convert a floating point probability to its integer representation, do the following:
All numbers are stored in little-endian (least significant byte first) order. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||