Group research interests

Link to another group item

A Sub-group

A research topic

Some meeting dates

Group people

 
Gavin Band's webspace

A Binary GEN file format - BGEN

This page documents the SNP block layout for version 1.0 of the BGEN format. For the latest version of the BGEN format, see here.

Note: The data layout documented here has the limitation that alleles can be only 1 character long. This prevents its use with (for example) the latest 1000 genomes data release, which contains structural variants many kilobases long. Version 1.1 was developed to be backwards-compatible with this format, and addresses this issue.

The snp blocks

Following the header comes a sequence of 0 or more snp blocks. Each snp block consists of the following data in order.

No. of bytesDescription
4The number of individuals the row represents, hereafter denoted N. This is an integer encoded in two bytes.
1An unsigned integer S, indicating the length of the storage used for the SNPID and RSID fields in the row.
1The length, SNPID_size of (the data part of) the SNPID string. This must be between 0 and S.
SThe SNPID of the row. Only the first SNPID_size bytes will be used.
1The length, RSID_size of (the data part of) the RSID string. This must be between 0 and S.
SThe RSID of the row. Only the first RSID_size bytes will be used.
1 The chromosome on which the SNP is found, encoded as an unsigned 8-bit integer. The encoding is:
1-22:SNP lies in the chromosome with the given number.
23SNP lies in the non pseudo-autosomal part of the X chromosome.
24:SNP lies in the non pseudo-autosomal part of the Y chromosome.
253:SNP lies in the pseudo-autosomal region of the X/Y chromosomes.
254:SNP lies in the mitochondrial DNA
255:Indicates the chromosome is unknown. We advise that this only be used for test data.
4The SNP position, encoded as an unsigned 32-bit integer.
1The A allele.
1The B allele.
PGenotype probability data for the SNP for each of the N individuals in the cohort. If the CompressedSNPBlocks flag is not set, this field consists of P=6*N bytes representing the probabilities. If CompressedSNPBlocks is set, this field contains a 32-bit unsigned integer specifying the length of the compressed data, followed by the compressed data itself. See below for more details.
13 + 2*S + PTOTAL

SNP block probability data

The probability data is listed as a sequence of 2-byte signed integers. These should be interpreted in triples, the first member being the probability of AA, the second the probability of AB, the third the probability of BB. Altogether these occupy 6*N bytes where N is the number of samples. These 6 * N bytes are written directly. Alternatively, if the CompressedSNPBlocks flag is set in the header, these 6 * N bytes are first compressed using zlib. The SNP block then contains a 4-byte integer representing the length of the compressed data followed by the compressed data itself.

To convert the stored 2-byte integers into probabilities, the following calculation should be performed:

  1. Convert the number into a floating-point format (e.g. float or double).
  2. Divide by 10,000.

Note that the range of a two-byte unsigned integer is 0 - 65535 inclusive. Thus the resulting probabilities can take on values between 0 and 6.5535 inclusive and are accurate to four decimal places.

Note: to convert a floating point number to the format, do the following:

  1. Check the number lies in the half-open interval [ 0, 6.55355 ).
  2. Multiply by 10000 and round to the nearest integer.

All numbers are stored in little-endian (least significant byte first) order.