Richard Mott's Home Page

Group Home Page

Sequence Alignment with monotonic gap penalties


Introduction

ariadne

prospero

file formats

installation

output

statistics

bugs

legal matters

 

[an error occurred while processing this directive]

ARIADNE V1.3

ariadne is designed for database searches, and can operate in several modes:

  • Sequences vs Sequences: Compare a library of sequence(s) against another
  • Sequences vs Self: Compare a library of sequence(s) against itself
  • Sequences vs Profiles: Compare a library of sequence(s) against a library of profile(s)
  • Profiles vs Sequences: Compare a library of profile(s) against a library of sequence(s)

(the last two modes produce the same results but in different orders)

A "library" contains one or more sequences or profiles (see formats).

ariadne differs from most other database-search programs in that statistical significance is determined on the fly, and statistically significant alignments are printed out in the order the comparisons were made, ie they are not sorted. A Perl script ariadne_sort.pl is provided to sort the output in best-first order.

Running ariadne

  usage: ariadne
        -mode           text                 [  ]
        -seq            Readable File        [  ]
        -seq2           Readable File        [  ]
        -profile        Readable File        [  ]
        -matrix         text                 [  ]
        -A              integer              [ 11 ]
        -B              integer              [ 1 ]
        -ethresh        float                [ 0.1 ]
        -dbsize         float                [ 1 ]
	-align          switch               [ true ]
	-all            switch               [ false ]
        -help           switch               [  ]      

  • -mode defines the type the search. It must be one of the following:
    • ss2self Compare a file of sequences against themselves, omitting duplicated and self-comparisons
    • s2ss Compare one sequence against a library of sequences
    • ss2ss Compare a library of sequences against another library
    • s2pp Compare one sequence against a library of profiles
    • p2ss Compare one profile against a library of sequences
    • pp2ss Compare a library of profiles against a library of sequences
    • ss2pp Compare a library of sequences against a library of profiles
    It is always faster to compare profile(s) to sequence(s), because of the overhead of reading the profiles, so use mode pp2ss in preference to ss2pp.
  • -seq {file} The first FASTA file of sequences
  • -seq2 {file} The second FASTA file of sequences (only required for modes s2ss or ss2ss ) NOTE: -seq specifies the query sequence and -seq2 the database in mode s2ss.
  • -profile {file} The file of profiles
  • -matrix {matrix} The name of the matrix Defines the substitution matrix. Only required for modes ss2self, s2ss, ss2ss. ariadne uses the BLAST matrix distribution, and looks in $BLASTMAT/{matrix}
  • -A {gap-open penalty}
  • -B {gap-extend penalty}
  • The gap penalties. NOTE: a gap of length k residues is scored as A+B*k. This convention is different from FASTA's which uses A+B*(k-1). So e.g. the default FASTA gap penalty of 12, 2 corresponds to A=10, B=2.
  • -dbsize {integer} The effective size, in SEQUENCES, of the database. You MUST set this number, otherwise it defaults to a database size of 1 sequence. Because ariadne prints out matches as it finds them, it does not know the database size ahead of time. Therefore you must provide this information explicitly. Note that the database size differs from the database length (used in BLAST).
    • You can set the dbsize to an artificial value such as 100000 so that you can compare the results of different searches more easily. This corresponds to fixing the size of the protein universe. Because of the redundancy (or near-redundancy) of many protein sequence databanks, it is not necessarily correct to set dbsize to the total number of sequences, because this is overly conservative. The "correct" value is the number of independent comparisons made in the search.
    • If you are comparing one library, size M, against another, size N, you may want to set dbsize=M*N.
    • If you are self-comparing a library of size N you may want to set dbsize=N*(N-1)/2
    • The database size is used to help determine the statistical significance of matches (see below).
  • -ethresh {float} The e-value threshold is used to define the cutoff for printing out alignments. The default value of ethresh is 0.1. Suppose a comparison has a pairwise p-value P, taking account of the sequence/profile lengths, composition and scoring scheme. The evalue E is defined as
    E = P*dbsize
    and if E < ethresh the alignment is printed out. WARNING: If you set ethresh to a high value (or set dbsize to a low value) then the search will take considerably longer. This is because statistical significance is assessed initially assuming a standard sequence composition, so that pre-computed parameters are used. If the estimated e-value is < 10*ethresh then statistical significance is recomputed using parameters that reflect the sequence/profile compositions accurately. This second computation is quite slow, but is usually only triggered in a small proportion of cases, provided the dbsize and ethresh are set appropriately.
    If you want the pairwise p-values printed, set the dbsize to 1 and set the ethresh to a small value such as 1.0e-8. You will then get all similarities with p-values < 1.0e-8. This will still be fast.
  • -[no]align Turn off the printing of alignments (just gives one-line summaries)
  • -[no]all For comparisons where there is more than one local alignment with significant evalue, print all rather than just the top-scoring [New in V1.3]

Filtering Output

Apart from the built-in threshold -ethresh, ariadne and prospero do not provide any other ways to filter output. However, the perl script prospero.pl will filter output based on alignment length, score, eval and percent identity. Run the script like this:


    prospero -seq1 gi130316.pep | prospero.pl -minscore 600 

Use the command-line switches -minscore, -minidentity, -minlen, -maxeval to control the output.


Please send Questions, Comments, and Bug Reports to Richard Mott

 
spacer