Richard Mott's Home Page

Group Home Page

Sequence Alignment with monotonic gap penalties


Introduction

ariadne

prospero

file formats

installation

output

statistics

bugs

legal matters

 

Wellcome Trust Centre for Human Genetics

STATISTICS

ariadne and prospero test the statistical significance of similarity scores using the following model (download this for more details):

A score S in a comparison between two sequences or a sequence and a profile has pairwise p-value P given by

P = 1-exp(-Kmn exp(-L*S))

where m,n are the sequence lengths and K, L are parameters depending on the compositions, scoring scheme and (slightly) on the sequence lengths. K, L are calculated using the formula described in Mott, 2000, which takes account of sequence composition, substitution matrix, gap penalty, sequence length:

L = L_u*(1.013 -2.61*alpha + f(m,n)( -0.76 + 9.34*alpha +1.12/H) )

K = K_u*exp( 0.26 -18.92*alpha + f(m,n)(-1.76 + 32.69*alpha + 192.52*alpha^2 + 3.24/H ) )

where: f(m,n) = log(m*n)*(1/m+1/n) L_u, K_u are the parameters for ungapped alignments H is the entropy of ungapped alignments (eg as defined Karlin-Altschul PNAS 1990, or Mott and Tribe 1999), and alpha is a parameter depending on the gap penalty A+B*k:

alpha = 2*s*exp(-L_u(A+B))/(1-exp(-L_u*B))

where

s = sqrt{ (K_u/H) * [ delta / exp(L_u *delta ) ] }

and delta is the smallest span of score values (usually 1)

In the example above, K = 1.361261e-01 L = 3.478619e-01, m = 51, n = 43, S = 125, so

P = 1-exp(-0.1361*51*43*exp(-0.3478*125)) = 5.95e-16. The database size was set at 100000 = 1.0e6 sequences, so the evalue is 5.95e-10.


Please send Questions, Comments, and Bug Reports to Richard Mott

 
spacer