The Wellcome Trust Centre
for Human Genetics
Copyright Richard Mott, 2000,
University of Oxford,
Wellcome Trust Centre for Human Genetics,
Roosevelt Drive OX3 7BN UK
Richard.Mott@well.ox.ac.uk
ARIADNE is a package of two programs, ariadne and prospero, that compare protein sequences and profiles using the Smith-Waterman algorithm, and assesses statistical significance using a new accurate formula, described in Mott, 2000, "Accurate Formula for P-values of gapped local sequence and profile alignments" J. Mol Biol. 300:649-659 [link to JMB] .
The package is written in ANSI C.
The sequence/profile comparison algorithms used in this code are fairly standard, and are probably not the fastest implementations available. The novel part is the method for determining statistical significance, which will give thresholds of significance that are accurate to within 5% 95% of the time. See Mott (2000) and Mott and Tribe (1999) for more details.
You are free to incorporate the method used for assessing statistical significance into third-party code, provided you cite the above reference. The routines for assessing significance are all in gaplib.c
A "library" contains one or more sequences or profiles (see formats).
ariadne differs from most other database-search programs in that statistical significance is determined on the fly, and statistically significant alignments are printed out in the order the comparisons were made, ie they are not sorted. A Perl script ariadne_sort.pl is provided to sort the output in best-first order.
Sequence libraries must be in FASTA format. e.g
>NP_005359 mglsdgewqlvlnvwgkveadipghgqevlirlfkghpetlekfdkfkhlksedemkase dlkkhgatvltalggilkkkghheaeikplaqshatkhkipvkylefiseciiqvlqskh pgdfgadaqgamnkalelfrkdmasnykelgfqgFormat for profiles:
A library of profiles can contain multiple profiles. Format for each profile is a 3-line header followed by the profile data. Blank lines are ignored. The header looks like
>name name of the profile, eg "globin" profile-length alphabet-length eg "141 21" alphabet-order List of space-separated amino-acids in the order they occur in the profileThe profile contains a series of lines, one per profile position, with format
[position] [consensus] [profile-scores in order defined by alphabet-order]
For example, part of a profile looks like this:
>globin
141 21
A R N D C Q E G H I L K M F P S T W Y V X
1 H -4 -1 2 -2 -5 -1 -1 -3 10 -5 -5 -2 -4 -3 -4 -2 -3 -4 0 -5 -1
2 L -4 -5 -6 -7 0 -5 -6 -6 -5 -1 6 -5 0 0 -6 -5 -4 6 -3 -2 -1
3 S -1 -4 -1 -1 -4 -2 -2 -2 -4 -4 -4 -3 -2 -5 -4 5 6 -5 -5 -3 -1
4 A 4 -3 0 3 -1 0 2 -1 -2 -3 -3 -1 -2 -3 0 0 -2 -6 -3 -3 -1
5 E 2 -1 -1 3 -3 0 3 -1 0 -5 -3 1 -4 -3 -1 -1 -1 -6 -4 -3 -1
6 E -3 -3 0 4 -6 4 5 -4 -3 -3 -5 -2 -4 -6 -4 -2 -1 -6 -5 -3 -1
7 K 0 3 -2 -4 -2 -1 1 -4 -1 0 -3 5 -2 -1 -4 -2 -3 4 -4 -1 -1
8 A 3 -1 0 0 -2 2 0 -2 1 -2 -2 1 -3 -2 -4 1 1 -5 -4 -3 -1
9 L 2 -2 2 -2 -2 -2 -1 -4 0 1 3 0 -1 -2 -5 -2 0 -5 -2 0 -1
10 V -2 -6 -6 -6 -4 -5 -5 -6 -6 5 1 -5 -1 -2 -5 -4 -3 -6 -4 6 -1
11 K -1 3 1 -2 -3 2 -2 -3 0 -3 0 5 0 -5 -4 -1 1 -5 -4 -2 -1
12 A 2 -1 0 1 -3 0 0 0 1 -4 -3 1 -2 -4 -4 3 0 -6 -4 -3 -1
13 L 0 -4 -2 -3 0 0 -2 -4 -1 1 1 -3 -1 -1 -4 3 2 -5 -3 1 -1
14 W -3 -6 -6 -5 -2 -5 -4 -4 -5 -2 -4 -6 -2 2 -7 -4 -4 13 0 -2 -1
........... etc etc
So eg at position 2 the consensus is L and the score for matching an A at this position is -4
Profiles and substitution matrices should be such that the smallest span of score values is 1 Also, for reasons of efficiency it is best is the total range of score values is quite small, say < 30
Some example profile sets derived from PFAM-A are provided for download.
Alternatively, you can download precompiled binaries for Linux, SGI IRIX64, SunOS, OSF.
To compile the sources, after untarring the distribution ariadne-1.2.tar.Z type the following commands
% cd ./SRC-1.2/ % makeThe binaries will be in the directory ./SRC-1.2/$UNAME where $UNAME is the machine architecture returned by the uname command. Either put this directory on your path or copy the binaries to a directory in your path.
NB You will need a version of make compatible with gnu make. If you don't have this, then you can compile the programs with the following commands executed in the source directory:
cc -o ./ariadne -O -I./Include cl.c gaplib.c profile_util.c prospero_utils.c searchpath.c cmp.c profile_stat.c ariadne.c readline.c seq_util.c -lm cc -o ./prospero -O -I./Include cl.c gaplib.c profile_util.c prospero_utils.c searchpath.c cmp.c profile_stat.c prospero.c readline.c seq_util.c -lmThis will leave the executables in the source directory; they will need to be moved elsewhere by hand.
usage: ariadne
-mode text [ ]
-seq Readable File [ ]
-seq2 Readable File [ ]
-profile Readable File [ ]
-matrix text [ ]
-A integer [ 11 ]
-B integer [ 1 ]
-ethresh float [ 0.1 ]
-dbsize float [ 1 ]
-help switch [ ]
It is always faster to compare profile(s) to sequence(s), because of the overhead of reading the profiles, so use mode pp2ss in preference to ss2pp.
E = P*dbsize
and if E < ethresh the alignment is printed out
WARNING: If you set ethresh to a high value (or set dbsize to a low value) then the search will take considerably longer. This is because statistical significance is assessed initially assuming a standard sequence composition, so that pre-computed parameters are used. If the estimated e-value is < 10*ethresh then statistical significance is recomputed using parameters that reflect the sequence/profile compositions accurately. This second computation is quite slow, but is usually only triggered in a small proportion of cases, provided the dbsize and ethresh are set appropriately.
If you want the pairwise p-values printed, set the dbsize to 1 and set the ethresh to a small value such as 1.0e-8. You will then get all similarities with p-values < 1.0e-8. This will still be fast.
> 0 d1bal__ len 51 from 14 to 49 vs d2pdd__ len 43 from 5 to 40 score 77 eval 6.954146e-10 identity 36.11% K 1.361261e-01 L 3.478619e-01 H 1.187679e+00 alpha 8.223475e-02
14 PAIRRLLAEHNLDASAIKGTGVGGRLTREDVEKHLA 49 d1bal__
|::|: | :| ::||| ||: :||:: ||
5 PSVRKYAREKGVDIRLVQGTGKNGRVLKEDIDAFLA 40 d2pdd__
The one-line summary says that the comparison between d1bal__ (51
residues) and d2pdd__ (43 residues) scored 77 with an e-value of
6.954e-10 and a percentage identity 36%. The parameters used for
assessing significance, K, L, H, alpha took the values 0.136, 0.347,
1.187, 0.082. [H, alpha are ony printed for information].
Profile-Sequence alignments look the same as Sequence-Sequence alignments, with the profile consensus printed.
A score S in a comparison between two sequences or a sequence and a profile has pairwise p-value P given by
P = 1-exp(-Kmn exp(-L*S))
where m,n are the sequence lengths and K, L are parameters depending on the compositions, scoring scheme and (slightly) on the sequence lengths. K, L are calculated using the formula described in Mott, 2000, which takes account of sequence composition, substitution matrix, gap penalty, sequence length:
L = L_u*(1.013 -2.61*alpha + f(m,n)( -0.76 + 9.34*alpha +1.12/H) )
K = K_u*exp( 0.26 -18.92*alpha + f(m,n)(-1.76 + 32.69*alpha + 192.52*alpha^2 + 3.24/H ) )
where: f(m,n) = log(m*n)*(1/m+1/n) L_u, K_u are the parameters for ungapped alignments H is the entropy of ungapped alignments (eg as defined Karlin-Altschul PNAS 1990, or Mott and Tribe 1999), and alpha is a parameter depending on the gap penalty A+B*k:
alpha = 2*s*exp(-L_u(A+B))/(1-exp(-L_u*B))
where
s = sqrt{ (K_u/H) * [ delta / exp(L_u *delta ) ] }
and delta is the smallest span of score values (usually 1)
In the example above, K = 1.361261e-01 L = 3.478619e-01, m = 51, n = 43, S = 125, so
P = 1-exp(-0.1361*51*43*exp(-0.3478*125)) = 5.95e-16. The database size was set at 100000 = 1.0e6 sequences, so the evalue is 5.95e-10.
usage: prospero -seq1 Readable File [ ] -seq2 Readable File [ ] -pro Readable File [ ] -matrix text [ blosum62 ] -A integer [ 11 ] -B integer [ 1 ] -top integer [ 5 ] -pseudocount integer [ 100 ] -ethresh float [ 0 ] -help switch [ ]
Output from prospero is very similar to ariadne's. The E-values reported are in fact P-values, as the database size is set to 1. The alignments are sorted by evalue. Example:
pangloss [84]% ./SRC-1.2/Linux/prospero -seq1 gi130316.pep -ethresh 0.001
***** PROSPERO V1.0 Sun May 14 13:25:04 2000 *****
Copyright Richard Mott, 2000, Wellcome Trust Centre for Human Genetics, University of Oxford
using gap penalty 11+1k
using matrix blosum62
printing all alignments with eval < 0.001000
using sequence1 gi|130316|sp|P00747|PLMN_HUMAN
using self-comparison
> 1 gi|130316|sp|P00747|PLMN_HUMAN len 810 from 102 to 455 vs gi|130316|sp|P00747|PLMN_HUMAN len 810 from 184 to 561 score 994 eval 7.000317e-93 identity 49.58% K 1.978546e-02 L 2.230037e-01 H 1.439763e+00 alpha 1.068759e-01
102 ECKTGNGKNYRGTMSKTKNGITCQKWSSTSPHRPRFSPATHPSEGLEENYCRNPDNDPQG 161 gi|130316|sp|P00747|PLMN_HUMAN
|| :|:|| | :||| :|: || | | ||| : |: |:: |::||||||| : :
184 ECMHCSGENYDGKISKTMSGLECQAWDSQSPHAHGYIPSKFPNKNLKKNYCRNPDRELR- 242 gi|130316|sp|P00747|PLMN_HUMAN
162 PWCYTTDPEKRYDYCDILECEE---------ECMHCSGENYDGKISKTMSGLECQAWDSQ 212 gi|130316|sp|P00747|PLMN_HUMAN
|||:|||| ||:: ||| | :|: :|||| | :: |:|| || | :|
243 PWCFTTDPNKRWELCDIPRCTTPPPSSGPTYQCLKGTGENYRGNVAVTVSGHTCQHWSAQ 302 gi|130316|sp|P00747|PLMN_HUMAN
213 SPHAHGYIPSKFPNKNLKKNYCRNPDRELRPWCFTTDPNKRWELCDIPRC---------- 262 gi|130316|sp|P00747|PLMN_HUMAN
:|| | | || ||| :||||||| : ||| ||: ||| | || |
303 TPHTHNRTPENFPCKNLDENYCRNPDGKRAPWCHTTNSQVRWEYCKIPSCDSSPVSTEQL 362 gi|130316|sp|P00747|PLMN_HUMAN
263 -TTPPPSSGPTYQ-CLKGTGENYRGNVAVTVSGHTCQHWSAQTPHTHNRTPENFPCKNLD 320 gi|130316|sp|P00747|PLMN_HUMAN
| || | | | | |::||| : | :| || ||: ||| | :||||:| |
363 APTAPPELTPVVQDCYHGDGQSYRGTSSTTTTGKKCQSWSSMTPHRHQKTPENYPNAGLT 422 gi|130316|sp|P00747|PLMN_HUMAN
321 ENYCRNPDGKRAPWCHTTNSQVRWEYCKIPSCDSSPVSTEQLAPTA--PPELTPVVQDCY 378 gi|130316|sp|P00747|PLMN_HUMAN
||||||| : ||| ||: |||||| : | : | | | || :||
423 MNYCRNPDADKGPWCFTTDPSVRWEYCNLKKCSGTEASVVAPPPVVLLPDVETPSEEDCM 482 gi|130316|sp|P00747|PLMN_HUMAN
379 HGDGQSYRGTSSTTTTGKKCQSWSSMTPHRHQ-KTPENYPNAGLTMNYCRNPDAD-KGPW 436 gi|130316|sp|P00747|PLMN_HUMAN
|:|: ||| :|| || || |:: |||| ||| | ||| ||||||| | |||
483 FGNGKGYRGKRATTVTGTPCQDWAAQEPHRHSIFTPETNPRAGLEKNYCRNPDGDVGGPW 542 gi|130316|sp|P00747|PLMN_HUMAN
437 CFTTDPSVRWEYCNLKKCS 455 gi|130316|sp|P00747|PLMN_HUMAN
|:||:| ::||:: :|:
543 CYTTNPRKLYDYCDVPQCA 561 gi|130316|sp|P00747|PLMN_HUMAN
> 2 gi|130316|sp|P00747|PLMN_HUMAN len 810 from 102 to 352 vs gi|130316|sp|P00747|PLMN_HUMAN len 810 from 274 to 560 score 670 eval 1.676058e-61 identity 48.00% K 1.978546e-02 L 2.230037e-01 H 1.439763e+00 alpha 1.068759e-01
102 ECKTGNGKNYRGTMSKTKNGITCQKWSSTSPHRPRFSPATHPSEGLEENYCRNPDNDPQG 161 gi|130316|sp|P00747|PLMN_HUMAN
:| | |:|||| :: | :| ||| ||: :|| :| | : |:|||||||| :
274 QCLKGTGENYRGNVAVTVSGHTCQHWSAQTPHTHNRTPENFPCKNLDENYCRNPDG-KRA 332 gi|130316|sp|P00747|PLMN_HUMAN
162 PWCYTTDPEKRYDYCDILECE---------------------EECMHCSGENYDGKISKT 200 gi|130316|sp|P00747|PLMN_HUMAN
|||:||: : |::|| | |: ::| | |::| | | |
333 PWCHTTNSQVRWEYCKIPSCDSSPVSTEQLAPTAPPELTPVVQDCYHGDGQSYRGTSSTT 392 gi|130316|sp|P00747|PLMN_HUMAN
201 MSGLECQAWDSQSPHAHGYIPSKFPNKNLKKNYCRNPDRELRPWCFTTDPNKRWELCDIP 260 gi|130316|sp|P00747|PLMN_HUMAN
:| :||:| | :|| | | :|| | ||||||| : ||||||||: ||| |::
393 TTGKKCQSWSSMTPHRHQKTPENYPNAGLTMNYCRNPDADKGPWCFTTDPSVRWEYCNLK 452 gi|130316|sp|P00747|PLMN_HUMAN
261 RCT-------TPPP-------SSGPTYQCLKGTGENYRGNVAVTVSGHTCQHWSAQTPHT 306 gi|130316|sp|P00747|PLMN_HUMAN
:|: ||| : |: | |: ||| | ||:| || |:|| ||
453 KCSGTEASVVAPPPVVLLPDVETPSEEDCMFGNGKGYRGKRATTVTGTPCQDWAAQEPHR 512 gi|130316|sp|P00747|PLMN_HUMAN
307 HN-RTPENFPCKNLDENYCRNPDGK-RAPWCHTTNSQVRWEYCKIPSC 352 gi|130316|sp|P00747|PLMN_HUMAN
|: ||| | |::|||||||| |||:||| : ::|| :| |
513 HSIFTPETNPRAGLEKNYCRNPDGDVGGPWCYTTNPRKLYDYCDVPQC 560 gi|130316|sp|P00747|PLMN_HUMAN
> 3 gi|130316|sp|P00747|PLMN_HUMAN len 810 from 100 to 288 vs gi|130316|sp|P00747|PLMN_HUMAN len 810 from 374 to 587 score 467 eval 7.668240e-42 identity 47.34% K 1.978546e-02 L 2.230037e-01 H 1.439763e+00 alpha 1.068759e-01
100 LSECKTGNGKNYRGTMSKTKNGITCQKWSSTSPHRPRFSPATHPSEGLEENYCRNPDNDP 159 gi|130316|sp|P00747|PLMN_HUMAN
: :| |:|::|||| | | | || ||| :||| : :| :|: || ||||||| |
374 VQDCYHGDGQSYRGTSSTTTTGKKCQSWSSMTPHRHQKTPENYPNAGLTMNYCRNPDAD- 432 gi|130316|sp|P00747|PLMN_HUMAN
160 QGPWCYTTDPEKRYDYCDILEC-----------------------EEECMHCSGENYDGK 196 gi|130316|sp|P00747|PLMN_HUMAN
:||||:|||| |::||:: :| ||:|| :|: | ||
433 KGPWCFTTDPSVRWEYCNLKKCSGTEASVVAPPPVVLLPDVETPSEEDCMFGNGKGYRGK 492 gi|130316|sp|P00747|PLMN_HUMAN
197 ISKTMSGLECQAWDSQSPHAHG-YIPSKFPNKNLKKNYCRNPDREL-RPWCFTTDPNKRW 254 gi|130316|sp|P00747|PLMN_HUMAN
: |::| || | :| || | : | | |:|||||||| :: |||:||:| | :
493 RATTVTGTPCQDWAAQEPHRHSIFTPETNPRAGLEKNYCRNPDGDVGGPWCYTTNPRKLY 552 gi|130316|sp|P00747|PLMN_HUMAN
255 ELCDIPRCTTPPPSSG-PTYQCLKGTGENYRGNVA 288 gi|130316|sp|P00747|PLMN_HUMAN
: ||:|:| | | | : | | | ||
553 DYCDVPQCAAPSFDCGKPQVEPKKCPGRVVGGCVA 587 gi|130316|sp|P00747|PLMN_HUMAN
> 4 gi|130316|sp|P00747|PLMN_HUMAN len 810 from 102 to 196 vs gi|130316|sp|P00747|PLMN_HUMAN len 810 from 480 to 575 score 284 eval 4.056174e-24 identity 50.53% K 1.978546e-02 L 2.230037e-01 H 1.439763e+00 alpha 1.068759e-01
102 ECKTGNGKNYRGTMSKTKNGITCQKWSSTSPHRPR-FSPATHPSEGLEENYCRNPDNDPQ 160 gi|130316|sp|P00747|PLMN_HUMAN
:| |||| ||| : | | || |:: ||| |:| |:| |||:||||||| |
480 DCMFGNGKGYRGKRATTVTGTPCQDWAAQEPHRHSIFTPETNPRAGLEKNYCRNPDGDVG 539 gi|130316|sp|P00747|PLMN_HUMAN
161 GPWCYTTDPEKRYDYCDILECEEECMHCSGENYDGK 196 gi|130316|sp|P00747|PLMN_HUMAN
|||||||:| | |||||: :| | : |
540 GPWCYTTNPRKLYDYCDVPQCAAPSFDCGKPQVEPK 575 gi|130316|sp|P00747|PLMN_HUMAN
pangloss [85]%
Ariadne will only find the optimal alignment between each pair of sequences - ie it does a single-pass Smith-Waterman comparison. Prospero will find all suboptimal alignments. Ariadne will be modified in the next release to find all significant subalignments.
The formula for statistical significance has been thoroughly tested on standard substition matrices (ie the PAM and BLOSUM series). If you use it on other matrices/profiles and get unexpected results please contact me.