|
Complex Trait Consortium Meeting, Oxford 1-3 July 2003 Sequence Alignment with ARIADNE Sequence Alignment with monotonic gap penalties Aligning an EST to a genomic sequence
BIOSAPIENS Network of Excellence Integrated Genotyping System (IGS)   |
Sequence Alignment with Monotonic Gap PenaltiesThis page describes an algorithm to find Smith-Waterman style local alignments using monotonic gap penalties. See Mott, R (1999) Local sequence alignments with monotonic gap penalties Bioinformatics 6:455-462. Statistical significance is estimated using the methods used in the ariadne package. AbstractMOTIVATION: Sequence alignments obtained using affine gap penalties are not always biologically correct, because the insertion of long gaps is over-penalised. There is a need for an efficient algorithm which can find local alignments using non-linear gap penalties. RESULTS: A dynamic programming algorithm is described which computes optimal local sequence alignments for arbitrary, monotonically increasing gap penalties, i.e. where the cost g(k) of inserting a gap of k symbols is such that g(k) >= g(k-1). The running time of the algorithm is dependent on the scoring scheme; if the expected score of an alignment between random, unrelated sequences of lengths m, n is proportional to log mn, then with one exception, the algorithm has expected running time O(mn). Elsewhere, the running time is no greater than O(mn(m+n)). Optimisations are described which appear to reduce the worst-case run- time to O(mn) in many cases. We show how using a non-affine gap penalty can dramatically increase the probability of detecting a similarity containing a long gap. AVAILABILITY: The source code can be donwnloaded for non-commercial use. Code is (C) Richard Mott 1999-2000. MONOTONE Installation instructions:To compile the C source code for the monotone program, type
gcc -o monotone *.c -lm
This will create the binary ./monotone To get help type
zeon:/projects/rmott/MONOTONE> ./monotone -help
usage: ./monotone
-seq1 Readable File [ ]
-seq2 Readable File [ ]
-matrix text [ blosum62 ]
-gap_start float [ 9 ]
-gap_extend float [ 3 ]
-gap_power float [ 0.5 ]
-type text [ log ]
-align switch [ true ]
-self switch [ false ]
-shuffle integer [ 0 ]
-opt switch [ true ]
-seed integer [ 71166 ]
-len1 integer [ 300 ]
-len2 integer [ 300 ]
-help switch [ ]
-seq1 and -seq2 should be FASTA-format protein sequences -matrix is a file containing a BLAST-format substitution matrix. If the environment variable BLASTMAT is defined then the program will look there. The gap penalty function g(k), where k is the gap length, is defined by the -type switch. It must be one of
the gap penalty constants A, B, C are defined using the switches -gap_start (A) and -gap_extend (B) and -gap_popwer (C) -align is a boolean switch that controls whether the alignment is printed -self is a boolean switch that controls whether a sequence is compared against itself, if 'true' then only -seq1 need be specified. the parameters -shuffle, -opt, -len1, -len2 are only used for simulation studies, and are not relevant for comparisons between real sequences. ExampleConsider the alignment between the protein sequences P69892 (Human hemoglobin gamma-2 chain) and AAO61494 (hemoglobin alpha 4 subunit [Takifugu rubripes]).
./monotone -seq1 P69892.fa -seq2 AAO61494.fa -align -type log -matrix BLOSUM62 -gap_extend 7
matrix BLOSUM62 9.000 7.000 seq1 P69892 seq2 AAO61494
gap penalty log is convex
lambda(0) 0.3151 alpha 0.0799 lambda(alpha) 0.2835
matrix BLOSUM62 gap log start 9 extend 7 len1 147 len2 141 lambda 0.3151 s 0.4632 alpha 0.0799 theta1 0.8998 theta2 0.9027 K1 0.0598 K2 0.1077 L_GEM 25.92
clower 6.725366 cupper 7.314104
score = 196.881897
5 teedkatitslwgkv--nvedaggetlgrllvvypwtqrffdsfgnlssa 53 P69892
::::| : :| : || | ::| |: :| :: :| | ::|
3 sqkekrllaevwkslipaaedigsdsllrmftafpgsktyf-shldispr 52 AAO61494
54 saimgnpkvkahgkkvltslgdaikhlddlkgtfaqlselhcdklhvdpe 103 P69892
| | : :||:|:: :: : : : | | | || :| :||
53 s-----phmlshgrkivlaiaegaqdisqlavnlaplqilhayqlridpt 97 AAO61494
104 nfkllgnvlvtvlaihfgkeftpevqaswqkmvtgvasalssry 147 P69892
||||| : |: || | | :|||| |: | :: |: |: :|
98 nfkllthcllvalachlgddftpeahaaidkylsafaavlaeky 141 AAO61494
score 196.8819 pval 6.372499e-22
Drop me an email for more details. |
||||||||||||
|