|
Richard Mott's Home Page
Group Home Page
Sequence Alignment with monotonic gap penalties
Introduction
ariadne
prospero
file formats
installation
output
statistics
bugs
legal matters
 
|
ARIADNE V1.3
ariadne
is designed for database searches, and can operate in several modes:
- Sequences vs Sequences: Compare a library of sequence(s) against another
- Sequences vs Self: Compare a library of sequence(s) against itself
- Sequences vs Profiles: Compare a library of sequence(s) against a library of profile(s)
- Profiles vs Sequences: Compare a library of profile(s) against a library of sequence(s)
(the last two modes produce the same results but in different orders)
A "library" contains one or more sequences or profiles (see formats).
ariadne differs from most other database-search programs in that
statistical significance is determined on the fly, and statistically
significant alignments are printed out in the order the comparisons
were made, ie they are not sorted. A Perl script ariadne_sort.pl is
provided to sort the output in best-first order.
Running ariadne
usage: ariadne
-mode text [ ]
-seq Readable File [ ]
-seq2 Readable File [ ]
-profile Readable File [ ]
-matrix text [ ]
-A integer [ 11 ]
-B integer [ 1 ]
-ethresh float [ 0.1 ]
-dbsize float [ 1 ]
-align switch [ true ]
-all switch [ false ]
-help switch [ ]
- -mode defines the type the search. It must be one of the following:
- ss2self Compare a file of sequences against themselves, omitting duplicated and self-comparisons
- s2ss Compare one sequence against a library of sequences
- ss2ss Compare a library of sequences against another library
- s2pp Compare one sequence against a library of profiles
- p2ss Compare one profile against a library of sequences
- pp2ss Compare a library of profiles against a library of sequences
- ss2pp Compare a library of sequences against a library of profiles
It is always faster to compare profile(s) to sequence(s), because of
the overhead of reading the profiles, so use mode pp2ss in preference to ss2pp.
- -seq {file} The first FASTA file of sequences
- -seq2 {file} The second FASTA file of sequences (only required for modes s2ss or ss2ss ) NOTE: -seq specifies the query sequence and -seq2 the database in mode s2ss.
- -profile {file} The file of profiles
- -matrix {matrix} The name of the matrix
Defines the substitution matrix. Only required for modes ss2self,
s2ss, ss2ss. ariadne uses the BLAST matrix distribution, and looks in
$BLASTMAT/{matrix}
- -A {gap-open penalty}
- -B {gap-extend penalty}
The gap penalties. NOTE: a gap of length k residues is scored as
A+B*k. This convention is different from FASTA's which uses
A+B*(k-1). So e.g. the default FASTA gap penalty of 12, 2 corresponds
to A=10, B=2.
- -dbsize {integer}
The effective size, in SEQUENCES, of the database. You MUST set
this number, otherwise it defaults to a database size of 1
sequence. Because ariadne prints out matches as it finds them, it does
not know the database size ahead of time. Therefore you must provide
this information explicitly. Note that the database size differs from
the database length (used in BLAST).
- You can set the dbsize to an artificial value such as 100000 so that
you can compare the results of different searches more easily. This
corresponds to fixing the size of the protein universe. Because of the
redundancy (or near-redundancy) of many protein sequence databanks, it
is not necessarily correct to set dbsize to the total number of
sequences, because this is overly conservative. The "correct" value is
the number of independent comparisons made in the search.
- If you are comparing one library, size M, against another, size N,
you may want to set dbsize=M*N.
- If you are self-comparing a library of size N you may want to set
dbsize=N*(N-1)/2
- The database size is used to help determine the statistical
significance of matches (see below).
- -ethresh {float}
The e-value threshold is used to define the cutoff for printing out
alignments. The default value of ethresh is 0.1.
Suppose a comparison has a pairwise p-value P, taking
account of the sequence/profile lengths, composition and scoring
scheme. The evalue E is defined as
E = P*dbsize
and if E < ethresh the alignment is printed out.
WARNING: If you set ethresh to a high value (or set dbsize to a low value) then
the search will take considerably longer. This is because statistical
significance is assessed initially assuming a standard sequence
composition, so that pre-computed parameters are used. If the
estimated e-value is < 10*ethresh then statistical significance is
recomputed using parameters that reflect the
sequence/profile compositions accurately. This second computation is quite slow,
but is usually only triggered in a small proportion of cases, provided
the dbsize and ethresh are set appropriately.
If you want the pairwise p-values printed, set the dbsize to 1 and set
the ethresh to a small value such as 1.0e-8. You will then get all
similarities with p-values < 1.0e-8. This will still be fast.
- -[no]align
Turn off the printing of alignments (just gives one-line summaries)
- -[no]all
For comparisons where there is more than one local alignment with significant evalue, print all rather than just the top-scoring [New in V1.3]
Filtering Output
Apart from the built-in threshold -ethresh, ariadne and prospero do not provide any other ways to filter output. However, the perl script prospero.pl will filter output based on alignment length, score, eval and percent identity. Run the script like this:
prospero -seq1 gi130316.pep | prospero.pl -minscore 600
Use the command-line switches -minscore, -minidentity, -minlen, -maxeval to control the output.
Please send Questions, Comments, and Bug Reports to Richard Mott
|
|