Wellcome Logo The Wellcome Trust Centre for Human GeneticsOU Crest
University of Oxford 

ARIADNE V1.2 README

Copyright Richard Mott, 2000,
University of Oxford,
Wellcome Trust Centre for Human Genetics,
Roosevelt Drive OX3 7BN UK

Richard.Mott@well.ox.ac.uk


Table of Contents

introduction
ariadne
prospero
file formats
downloading and installing
running ariadne
output
statistical model
running prospero
known bugs and limitations

Introduction

ARIADNE is a package of two programs, ariadne and prospero, that compare protein sequences and profiles using the Smith-Waterman algorithm, and assesses statistical significance using a new accurate formula, described in Mott, 2000, "Accurate Formula for P-values of gapped local sequence and profile alignments" J. Mol Biol. 300:649-659 [link to JMB] .

The package is written in ANSI C.

The sequence/profile comparison algorithms used in this code are fairly standard, and are probably not the fastest implementations available. The novel part is the method for determining statistical significance, which will give thresholds of significance that are accurate to within 5% 95% of the time. See Mott (2000) and Mott and Tribe (1999) for more details.

You are free to incorporate the method used for assessing statistical significance into third-party code, provided you cite the above reference. The routines for assessing significance are all in gaplib.c


ariadne

is designed for database searches, and can operate in several modes:

(the last two modes produce the same results but in different orders)

A "library" contains one or more sequences or profiles (see formats).

ariadne differs from most other database-search programs in that statistical significance is determined on the fly, and statistically significant alignments are printed out in the order the comparisons were made, ie they are not sorted. A Perl script ariadne_sort.pl is provided to sort the output in best-first order.


prospero

will compare a sequence to itself, another sequence or a profile, and print all local alignments with p-values less than some user-defined threshold. Thus prospero is ideal for the analysis of repeats within a sequence.


Input File Formats

Substitution matrices (required for sequence to sequence comparisons) must be in BLAST format. [ARIADNE will look for matrices in the directory pointed to by the environment variable BLASTMAT if it is defined.]

Sequence libraries must be in FASTA format. e.g

>NP_005359
mglsdgewqlvlnvwgkveadipghgqevlirlfkghpetlekfdkfkhlksedemkase
dlkkhgatvltalggilkkkghheaeikplaqshatkhkipvkylefiseciiqvlqskh
pgdfgadaqgamnkalelfrkdmasnykelgfqg
Format for profiles:

A library of profiles can contain multiple profiles. Format for each profile is a 3-line header followed by the profile data. Blank lines are ignored. The header looks like

>name             		name of the profile, eg "globin"
profile-length alphabet-length  eg "141 21"
alphabet-order                  List of space-separated amino-acids in the order they occur in the profile
The profile contains a series of lines, one per profile position, with format

[position] [consensus] [profile-scores in order defined by alphabet-order]

For example, part of a profile looks like this:

>globin
141 21
A R N D C Q E G H I L K M F P S T W Y V X
    1 H   -4  -1   2  -2  -5  -1  -1  -3  10  -5  -5  -2  -4  -3  -4  -2  -3  -4   0  -5 -1
    2 L   -4  -5  -6  -7   0  -5  -6  -6  -5  -1   6  -5   0   0  -6  -5  -4   6  -3  -2 -1
    3 S   -1  -4  -1  -1  -4  -2  -2  -2  -4  -4  -4  -3  -2  -5  -4   5   6  -5  -5  -3 -1
    4 A    4  -3   0   3  -1   0   2  -1  -2  -3  -3  -1  -2  -3   0   0  -2  -6  -3  -3 -1
    5 E    2  -1  -1   3  -3   0   3  -1   0  -5  -3   1  -4  -3  -1  -1  -1  -6  -4  -3 -1
    6 E   -3  -3   0   4  -6   4   5  -4  -3  -3  -5  -2  -4  -6  -4  -2  -1  -6  -5  -3 -1
    7 K    0   3  -2  -4  -2  -1   1  -4  -1   0  -3   5  -2  -1  -4  -2  -3   4  -4  -1 -1
    8 A    3  -1   0   0  -2   2   0  -2   1  -2  -2   1  -3  -2  -4   1   1  -5  -4  -3 -1
    9 L    2  -2   2  -2  -2  -2  -1  -4   0   1   3   0  -1  -2  -5  -2   0  -5  -2   0 -1
   10 V   -2  -6  -6  -6  -4  -5  -5  -6  -6   5   1  -5  -1  -2  -5  -4  -3  -6  -4   6 -1
   11 K   -1   3   1  -2  -3   2  -2  -3   0  -3   0   5   0  -5  -4  -1   1  -5  -4  -2 -1
   12 A    2  -1   0   1  -3   0   0   0   1  -4  -3   1  -2  -4  -4   3   0  -6  -4  -3 -1
   13 L    0  -4  -2  -3   0   0  -2  -4  -1   1   1  -3  -1  -1  -4   3   2  -5  -3   1 -1
   14 W   -3  -6  -6  -5  -2  -5  -4  -4  -5  -2  -4  -6  -2   2  -7  -4  -4  13   0  -2 -1
........... etc etc
So eg at position 2 the consensus is L and the score for matching an A at this position is -4

Profiles and substitution matrices should be such that the smallest span of score values is 1 Also, for reasons of efficiency it is best is the total range of score values is quite small, say < 30

Some example profile sets derived from PFAM-A are provided for download.


Downloading and Installing ARIADNE

A compressed tarred source distribution is available for download here.

Alternatively, you can download precompiled binaries for Linux, SGI IRIX64, SunOS, OSF.

To compile the sources, after untarring the distribution ariadne-1.2.tar.Z type the following commands

% cd ./SRC-1.2/
% make
The binaries will be in the directory ./SRC-1.2/$UNAME where $UNAME is the machine architecture returned by the uname command. Either put this directory on your path or copy the binaries to a directory in your path.

NB You will need a version of make compatible with gnu make. If you don't have this, then you can compile the programs with the following commands executed in the source directory:

cc -o ./ariadne -O -I./Include cl.c gaplib.c profile_util.c prospero_utils.c searchpath.c cmp.c profile_stat.c ariadne.c readline.c seq_util.c -lm 

cc -o ./prospero -O -I./Include cl.c gaplib.c profile_util.c prospero_utils.c searchpath.c cmp.c profile_stat.c prospero.c readline.c seq_util.c -lm 
This will leave the executables in the source directory; they will need to be moved elsewhere by hand.


Running ariadne

usage: ariadne
        -mode           text                 [  ]
        -seq            Readable File        [  ]
        -seq2           Readable File        [  ]
        -profile        Readable File        [  ]
        -matrix         text                 [  ]
        -A              integer              [ 11 ]
        -B              integer              [ 1 ]
        -ethresh        float                [ 0.1 ]
        -dbsize         float                [ 1 ]
        -help           switch               [  ]      


Output

ariadne prints out statistically significant alignments in the order that it finds them. Each alignment is preceeded by a one-line summary, e.g.
> 0 d1bal__ len 51 from 14 to 49  vs  d2pdd__ len 43 from 5 to 40   score 77  eval 6.954146e-10 identity 36.11% K 1.361261e-01 L 3.478619e-01 H 1.187679e+00 alpha 8.223475e-02

   14 PAIRRLLAEHNLDASAIKGTGVGGRLTREDVEKHLA    49  d1bal__
      |::|:   |  :|   ::|||  ||: :||::  ||
    5 PSVRKYAREKGVDIRLVQGTGKNGRVLKEDIDAFLA    40  d2pdd__

The one-line summary says that the comparison between d1bal__ (51 residues) and d2pdd__ (43 residues) scored 77 with an e-value of 6.954e-10 and a percentage identity 36%. The parameters used for assessing significance, K, L, H, alpha took the values 0.136, 0.347, 1.187, 0.082. [H, alpha are ony printed for information].

Profile-Sequence alignments look the same as Sequence-Sequence alignments, with the profile consensus printed.


Statistical Model

ariadne and prospero test the statistical significance of similarity scores using the following model ( download this for more details):

A score S in a comparison between two sequences or a sequence and a profile has pairwise p-value P given by

P = 1-exp(-Kmn exp(-L*S))

where m,n are the sequence lengths and K, L are parameters depending on the compositions, scoring scheme and (slightly) on the sequence lengths. K, L are calculated using the formula described in Mott, 2000, which takes account of sequence composition, substitution matrix, gap penalty, sequence length:

L = L_u*(1.013 -2.61*alpha + f(m,n)( -0.76 + 9.34*alpha +1.12/H) )

K = K_u*exp( 0.26 -18.92*alpha + f(m,n)(-1.76 + 32.69*alpha + 192.52*alpha^2 + 3.24/H ) )

where: f(m,n) = log(m*n)*(1/m+1/n) L_u, K_u are the parameters for ungapped alignments H is the entropy of ungapped alignments (eg as defined Karlin-Altschul PNAS 1990, or Mott and Tribe 1999), and alpha is a parameter depending on the gap penalty A+B*k:

alpha = 2*s*exp(-L_u(A+B))/(1-exp(-L_u*B))

where

s = sqrt{ (K_u/H) * [ delta / exp(L_u *delta ) ] }

and delta is the smallest span of score values (usually 1)

In the example above, K = 1.361261e-01 L = 3.478619e-01, m = 51, n = 43, S = 125, so

P = 1-exp(-0.1361*51*43*exp(-0.3478*125)) = 5.95e-16. The database size was set at 100000 = 1.0e6 sequences, so the evalue is 5.95e-10.


Running prospero

Most of the command-line switches are the same as for ariadne:
usage: prospero
	-seq1           Readable File        [  ]
	-seq2           Readable File        [  ]
	-pro            Readable File        [  ]
	-matrix         text                 [ blosum62 ]
	-A              integer              [ 11 ]
	-B              integer              [ 1 ]
	-top            integer              [ 5 ]
	-pseudocount    integer              [ 100 ]
	-ethresh        float                [ 0 ]
	-help           switch               [  ]

Output from prospero is very similar to ariadne's. The E-values reported are in fact P-values, as the database size is set to 1. The alignments are sorted by evalue. Example:


pangloss [84]% ./SRC-1.2/Linux/prospero -seq1 gi130316.pep -ethresh 0.001
***** PROSPERO V1.0  Sun May 14 13:25:04 2000 *****

Copyright Richard Mott, 2000, Wellcome Trust Centre for Human Genetics, University of Oxford

using gap penalty 11+1k
using matrix blosum62
printing all alignments with eval < 0.001000
using sequence1 gi|130316|sp|P00747|PLMN_HUMAN
using self-comparison

> 1 gi|130316|sp|P00747|PLMN_HUMAN len 810 from 102 to 455  vs  gi|130316|sp|P00747|PLMN_HUMAN len 810 from 184 to 561   score 994  eval 7.000317e-93 identity 49.58% K 1.978546e-02 L 2.230037e-01 H 1.439763e+00 alpha 1.068759e-01

  102 ECKTGNGKNYRGTMSKTKNGITCQKWSSTSPHRPRFSPATHPSEGLEENYCRNPDNDPQG   161  gi|130316|sp|P00747|PLMN_HUMAN
      ||   :|:|| | :||| :|: || | | |||   : |:  |:: |::||||||| : : 
  184 ECMHCSGENYDGKISKTMSGLECQAWDSQSPHAHGYIPSKFPNKNLKKNYCRNPDRELR-   242  gi|130316|sp|P00747|PLMN_HUMAN

  162 PWCYTTDPEKRYDYCDILECEE---------ECMHCSGENYDGKISKTMSGLECQAWDSQ   212  gi|130316|sp|P00747|PLMN_HUMAN
      |||:|||| ||:: |||  |           :|:  :|||| | :: |:||  || | :|
  243 PWCFTTDPNKRWELCDIPRCTTPPPSSGPTYQCLKGTGENYRGNVAVTVSGHTCQHWSAQ   302  gi|130316|sp|P00747|PLMN_HUMAN

  213 SPHAHGYIPSKFPNKNLKKNYCRNPDRELRPWCFTTDPNKRWELCDIPRC----------   262  gi|130316|sp|P00747|PLMN_HUMAN
      :|| |   |  || ||| :||||||| :  ||| ||:   ||| | || |          
  303 TPHTHNRTPENFPCKNLDENYCRNPDGKRAPWCHTTNSQVRWEYCKIPSCDSSPVSTEQL   362  gi|130316|sp|P00747|PLMN_HUMAN

  263 -TTPPPSSGPTYQ-CLKGTGENYRGNVAVTVSGHTCQHWSAQTPHTHNRTPENFPCKNLD   320  gi|130316|sp|P00747|PLMN_HUMAN
        | ||   |  | |  | |::|||  : | :|  || ||: ||| | :||||:|   | 
  363 APTAPPELTPVVQDCYHGDGQSYRGTSSTTTTGKKCQSWSSMTPHRHQKTPENYPNAGLT   422  gi|130316|sp|P00747|PLMN_HUMAN

  321 ENYCRNPDGKRAPWCHTTNSQVRWEYCKIPSCDSSPVSTEQLAPTA--PPELTPVVQDCY   378  gi|130316|sp|P00747|PLMN_HUMAN
       |||||||  : ||| ||:  |||||| :  |  :  |     |    |   ||  :|| 
  423 MNYCRNPDADKGPWCFTTDPSVRWEYCNLKKCSGTEASVVAPPPVVLLPDVETPSEEDCM   482  gi|130316|sp|P00747|PLMN_HUMAN

  379 HGDGQSYRGTSSTTTTGKKCQSWSSMTPHRHQ-KTPENYPNAGLTMNYCRNPDAD-KGPW   436  gi|130316|sp|P00747|PLMN_HUMAN
       |:|: |||  :|| ||  || |::  ||||   |||  | |||  ||||||| |  |||
  483 FGNGKGYRGKRATTVTGTPCQDWAAQEPHRHSIFTPETNPRAGLEKNYCRNPDGDVGGPW   542  gi|130316|sp|P00747|PLMN_HUMAN

  437 CFTTDPSVRWEYCNLKKCS   455  gi|130316|sp|P00747|PLMN_HUMAN
      |:||:|   ::||:: :|:
  543 CYTTNPRKLYDYCDVPQCA   561  gi|130316|sp|P00747|PLMN_HUMAN

> 2 gi|130316|sp|P00747|PLMN_HUMAN len 810 from 102 to 352  vs  gi|130316|sp|P00747|PLMN_HUMAN len 810 from 274 to 560   score 670  eval 1.676058e-61 identity 48.00% K 1.978546e-02 L 2.230037e-01 H 1.439763e+00 alpha 1.068759e-01

  102 ECKTGNGKNYRGTMSKTKNGITCQKWSSTSPHRPRFSPATHPSEGLEENYCRNPDNDPQG   161  gi|130316|sp|P00747|PLMN_HUMAN
      :|  | |:|||| :: | :| ||| ||: :||    :|   | : |:||||||||   : 
  274 QCLKGTGENYRGNVAVTVSGHTCQHWSAQTPHTHNRTPENFPCKNLDENYCRNPDG-KRA   332  gi|130316|sp|P00747|PLMN_HUMAN

  162 PWCYTTDPEKRYDYCDILECE---------------------EECMHCSGENYDGKISKT   200  gi|130316|sp|P00747|PLMN_HUMAN
      |||:||: : |::|| |  |:                     ::| |  |::| |  | |
  333 PWCHTTNSQVRWEYCKIPSCDSSPVSTEQLAPTAPPELTPVVQDCYHGDGQSYRGTSSTT   392  gi|130316|sp|P00747|PLMN_HUMAN

  201 MSGLECQAWDSQSPHAHGYIPSKFPNKNLKKNYCRNPDRELRPWCFTTDPNKRWELCDIP   260  gi|130316|sp|P00747|PLMN_HUMAN
       :| :||:| | :|| |   |  :||  |  ||||||| :  ||||||||: ||| |:: 
  393 TTGKKCQSWSSMTPHRHQKTPENYPNAGLTMNYCRNPDADKGPWCFTTDPSVRWEYCNLK   452  gi|130316|sp|P00747|PLMN_HUMAN

  261 RCT-------TPPP-------SSGPTYQCLKGTGENYRGNVAVTVSGHTCQHWSAQTPHT   306  gi|130316|sp|P00747|PLMN_HUMAN
      :|:        |||        :     |: | |: |||  | ||:|  || |:|| || 
  453 KCSGTEASVVAPPPVVLLPDVETPSEEDCMFGNGKGYRGKRATTVTGTPCQDWAAQEPHR   512  gi|130316|sp|P00747|PLMN_HUMAN

  307 HN-RTPENFPCKNLDENYCRNPDGK-RAPWCHTTNSQVRWEYCKIPSC   352  gi|130316|sp|P00747|PLMN_HUMAN
      |:  |||  |   |::||||||||    |||:||| :  ::|| :| |
  513 HSIFTPETNPRAGLEKNYCRNPDGDVGGPWCYTTNPRKLYDYCDVPQC   560  gi|130316|sp|P00747|PLMN_HUMAN

> 3 gi|130316|sp|P00747|PLMN_HUMAN len 810 from 100 to 288  vs  gi|130316|sp|P00747|PLMN_HUMAN len 810 from 374 to 587   score 467  eval 7.668240e-42 identity 47.34% K 1.978546e-02 L 2.230037e-01 H 1.439763e+00 alpha 1.068759e-01

  100 LSECKTGNGKNYRGTMSKTKNGITCQKWSSTSPHRPRFSPATHPSEGLEENYCRNPDNDP   159  gi|130316|sp|P00747|PLMN_HUMAN
      : :|  |:|::|||| | |  |  || ||| :||| : :|  :|: ||  ||||||| | 
  374 VQDCYHGDGQSYRGTSSTTTTGKKCQSWSSMTPHRHQKTPENYPNAGLTMNYCRNPDAD-   432  gi|130316|sp|P00747|PLMN_HUMAN

  160 QGPWCYTTDPEKRYDYCDILEC-----------------------EEECMHCSGENYDGK   196  gi|130316|sp|P00747|PLMN_HUMAN
      :||||:||||  |::||:: :|                       ||:||  :|: | ||
  433 KGPWCFTTDPSVRWEYCNLKKCSGTEASVVAPPPVVLLPDVETPSEEDCMFGNGKGYRGK   492  gi|130316|sp|P00747|PLMN_HUMAN

  197 ISKTMSGLECQAWDSQSPHAHG-YIPSKFPNKNLKKNYCRNPDREL-RPWCFTTDPNKRW   254  gi|130316|sp|P00747|PLMN_HUMAN
       : |::|  || | :| || |  : |   |   |:|||||||| ::  |||:||:| | :
  493 RATTVTGTPCQDWAAQEPHRHSIFTPETNPRAGLEKNYCRNPDGDVGGPWCYTTNPRKLY   552  gi|130316|sp|P00747|PLMN_HUMAN

  255 ELCDIPRCTTPPPSSG-PTYQCLKGTGENYRGNVA   288  gi|130316|sp|P00747|PLMN_HUMAN
      : ||:|:|  |    | |  :  |  |    | ||
  553 DYCDVPQCAAPSFDCGKPQVEPKKCPGRVVGGCVA   587  gi|130316|sp|P00747|PLMN_HUMAN

> 4 gi|130316|sp|P00747|PLMN_HUMAN len 810 from 102 to 196  vs  gi|130316|sp|P00747|PLMN_HUMAN len 810 from 480 to 575   score 284  eval 4.056174e-24 identity 50.53% K 1.978546e-02 L 2.230037e-01 H 1.439763e+00 alpha 1.068759e-01

  102 ECKTGNGKNYRGTMSKTKNGITCQKWSSTSPHRPR-FSPATHPSEGLEENYCRNPDNDPQ   160  gi|130316|sp|P00747|PLMN_HUMAN
      :|  |||| |||  : |  |  || |::  |||   |:| |:|  |||:||||||| |  
  480 DCMFGNGKGYRGKRATTVTGTPCQDWAAQEPHRHSIFTPETNPRAGLEKNYCRNPDGDVG   539  gi|130316|sp|P00747|PLMN_HUMAN

  161 GPWCYTTDPEKRYDYCDILECEEECMHCSGENYDGK   196  gi|130316|sp|P00747|PLMN_HUMAN
      |||||||:| | |||||: :|      |     : |
  540 GPWCYTTNPRKLYDYCDVPQCAAPSFDCGKPQVEPK   575  gi|130316|sp|P00747|PLMN_HUMAN

pangloss [85]% 

Known Bugs and limitations

Currently the smallest span of score values in the substitution matrix or profile must be 1. (Will be fixed shortly)

Ariadne will only find the optimal alignment between each pair of sequences - ie it does a single-pass Smith-Waterman comparison. Prospero will find all suboptimal alignments. Ariadne will be modified in the next release to find all significant subalignments.

The formula for statistical significance has been thoroughly tested on standard substition matrices (ie the PAM and BLOSUM series). If you use it on other matrices/profiles and get unexpected results please contact me.


Please send Questions, Comments, and Bug Reports to Richard Mott