Getting started
LDBIRD computes LD metrics (i.e. the genotype correlation ) between all pairs of variants in two sets of genotypes specified on the command line, or between all pairs in one set of genotypes. It records a histogram of the LD values it sees and outputs specific pairs meeting specific thresholds. The two ways to run it are:
Run with one genotypes file:
ldbird -g1 <file.bgen> -s samples.sample -o ldbird.sqlite
This computes genotype counts and correlation () between every pair of variants in file.bgen
. By default only
those with will be output, but if you specify -min-r2 0
then the number of output rows will be .
Run with two genotypes files:
ldbird -g1 <file1.bgen> -g2 <file2.bgen> -s samples.sample -o ldbird.sqlite
This computes genotype counts and correlation between every variant in file1.bgen
and in file2.bgen
. By default
only those with will be output, but if you specify -min-r2 0
then the number of output rows will be , where is the number of variants in file<i>.bgen
.
Supported file formats
You don't necessarily need to use BGEN files - LDBIRD also supports VCF format and a subset of other formats supported by QCTOOL.
Samples must be supplied in a sample file of the type understood by QCTOOL - see the QCTOOL file formats page for details.
What you can put in
At the time of writing there are three types of genotype you can put in. First, the input files can contain haploid
genotypes. If so then the R
table will contain a table of counts at both loci as well as the computed R value.
Second, diploid genotypes can be put in - in this case the table of counts is currently not output.
A third option is to input haploid genotypes that are encoded as if they were diploid. The -assume-haploid
option
tells LDBIRD to convert these internally to haploid genotypes - it does this by treating heterozygous calls as missing
and converting each homozygous call to the corresponding haploid genotype.
Understanding the output
LDBIRD always outputs results to a sqlite database. The output file has the following features:
-
The analysis name, its start and stop times, and the command-line arguments are recorded in the
Analysis
,AnalysisStatus
, andAnalysisProperty
tables. -
LDBIRD computes the frequency of each variant it sees. The variants are stored in the
Variant
table and the frequencies are stored in theFrequency
table. A convenient view,FrequencyView
is also created to link these tables together. -
The main results are stored in the
R
table. A more convenient view,RView
is also created which links theR
,Analysis
, andVariant
tables. -
A histogram of values is also stored in the
Histogram
table - again there is a more convenientHistogramView
view which joins this to theAnalysis
table.
Using the R
table
To save space in the output file, LDBIRD currently encodes correlations by the following formula:
This is stored in the encoded_r
column of the R
table. Because sqlite uses a variable-length integer encoding,
this value only takes up one or two bytes of space in the output file.
To convert encoded_r
back to correlation, you can use the formula:
E.g. in sqlite:
$ sqlite3 ldbird.sqlite
> SELECT *, (encoded_r-1024.0)/1024.0 AS r FROM R
(NB. the form 1024.0
is needed here to force sqlite to treat the result as a floating-point number.)
The above means that LDBIRD stores a quantized version of the correlation - it is always encoded as an integer between 0 and 2048 (inclusive). Correlation is expressed with respect to the second allele of each variant - this values in the range 0-1023 represent -ve correlation between the second allele, and values in the range 1025-2048 represent positive correlation.
Controlling what comparisons are made
There are three additional options which control what variants LDBIRD computes LD for:
-
The
-min-distance
option tells LDBIRD not to compute LD for any variants that are too close together on the same chromosome. -
The
-min-maf
option tells LDBIRD not to compute LD for any variants that have too low minor allele frequency -
The
-min-N
option tells LDBIRD to not compute LD for any pair of variants with fewer than this number of samples having non-missing genotypes (taken pairwise across the two variants). -
The
-min-N-propn
option tells LDBIRD to not compute LD for any pair of variants with fewer than this proportion of samples having of non-missing genotypes (taken pairwise across the two variants). -
Finally, the
-min-r2
option tells LDBIRD not to output results in theR
table for any variants where the computed is less than the given threshold. (This is slightly different to the above options because LD is still computed and contributes to theHistogram
table).
Additionally a number of options are provided to filter the set of samples included in the analysis, or to filter the genomic ranges or IDs of variants that are included. See the page on filtering samples for more information.
Using the Histogram
table
At the end of the run, LDBIRD stores a histogram of LD in the Histogram
table. This is set up as follows: for each
value of encoded_r
, the corresponding row of Histogram
stores the number of variant pairs that were observed to
have that encoded_r
value. The column encoded_r
can be converted back to as shown above. The Histogram
table always has 2049 rows (or 2049 per analysis, if you run multiple analyses into the same output file) corresponding
to the 2049 possible values of encoded_r
.
The Histogram
table ignores any minimum value specified using -min-r2
- in cases of testing long-range LD
it will therefore likely contain a peak of LD values near (i.e. around ), while if you
test nearby variants you will get a spike near ($\text{encoded r}=2048). However, it respects the other filtering
options outlined above, i.e. only variants with sufficient frequency and levels of non-missingness will be represented.
Setting up convenience views
It is generally most useful to wrap all the above into a combined view of the frequencies and LD. We recommend setting this up by running the following SQL in the output file:
CREATE VIEW MyRView AS
SELECT A.name AS `analysis`,
V1.id AS g1_id, V1.rsid AS `g1_rsid`, V1.chromosome AS `g1_chromosome`, V1.position AS `g1_position`,
V2.id AS g2_id, V2.rsid AS `g2_rsid`, V2.chromosome AS `g2_chromosome`, V2.position AS `g2_position`,
R.*, (R.encoded_r-1024.0)/1024 AS r,
F1.frequency AS g1_frequency, F2.frequency AS g2_frequency
FROM R
INNER JOIN Variant V1 ON V1.id = g1_id
INNER JOIN Variant V2 ON V2.id = g2_id
CROSS JOIN Analysis A ON A.id = R.analysis_id
CROSS JOIN Frequency F1 ON F1.analysis_id == R.analysis_id AND F1.variant_id == R.g1_id
CROSS JOIN Frequency F2 ON F2.analysis_id == R.analysis_id AND F2.variant_id == R.g2_id;
Which can be used to produce output like this (here run on test data):
$ sqlite3 -column -header ldbird.sqlite "SELECT * FROM MyRView LIMIT 10"
analysis g1_id g1_rsid g1_chromosome g1_position g2_id g2_rsid g2_chromosome g2_position analysis_id g1_id:1 g2_id:1 N encoded_r r g1_frequency g2_frequency
--------------- ---------- ---------- ------------- ----------- ---------- ---------- ------------- ----------- ----------- ---------- ---------- ---------- ---------- ---------- ------------ ------------
ldbird analysis 1 H1 H1 1 1 H1 H1 1 1 1 1 4000 2048 1.0 0.181 0.181
ldbird analysis 1 H1 H1 1 2 H2 H1 2 1 1 2 4000 1041 0.01660156 0.181 0.14225
ldbird analysis 1 H1 H1 1 3 H3 H1 3 1 1 3 4000 1017 -0.0068359 0.181 0.19
ldbird analysis 1 H1 H1 1 4 H4 H1 4 1 1 4 4000 1051 0.02636718 0.181 0.01025
ldbird analysis 1 H1 H1 1 5 H5 H1 5 1 1 5 4000 1022 -0.0019531 0.181 0.31725
ldbird analysis 1 H1 H1 1 6 H6 H1 6 1 1 6 4000 982 -0.0410156 0.181 0.00525
ldbird analysis 1 H1 H1 1 7 H7 H1 7 1 1 7 4000 992 -0.03125 0.181 0.14725
ldbird analysis 1 H1 H1 1 8 H8 H1 8 1 1 8 4000 1022 -0.0019531 0.181 0.1465
ldbird analysis 1 H1 H1 1 9 H9 H1 9 1 1 9 4000 1026 0.00195312 0.181 0.039
ldbird analysis 1 H1 H1 1 10 H10 H1 10 1 1 10 4000 1002 -0.0214843 0.181 0.405
LDBIRD doesn't create this view by itself currently, but a future version might do so. A similar view of the Histogram table can also be made.