Sequence Alignment with Monotonic Gap Penalties
This page describes an algorithm to find Smith-Waterman style local alignments using monotonic gap penalties.
Statistical significance is estimated using the methods used in the ariadne package.
MOTIVATION: Sequence alignments obtained using affine gap penalties are not always biologically correct, because the insertion of long gaps is over-penalised. There is a need for an efficient algorithm which can find local alignments using non-linear gap penalties.
RESULTS: A dynamic programming algorithm is described which computes optimal local sequence alignments for arbitrary, monotonically increasing gap penalties, i.e. where the cost g(k) of inserting a gap of k symbols is such that g(k) >= g(k-1). The running time of the algorithm is dependent on the scoring scheme; if the expected score of an alignment between random, unrelated sequences of lengths m, n is proportional to log mn, then with one exception, the algorithm has expected running time O(mn). Elsewhere, the running time is no greater than O(mn(m+n)). Optimisations are described which appear to reduce the worst-case run- time to O(mn) in many cases. We show how using a non-affine gap penalty can dramatically increase the probability of detecting a similarity containing a long gap.
AVAILABILITY: The source code can be donwnloaded for non-commercial use. Code is (C) Richard Mott 1999-2000.
MONOTONE Installation instructions:
To compile the C source code for the monotone program, type
gcc -o monotone *.c -lm
This will create the binary ./monotone To get help type
zeon:/projects/rmott/MONOTONE> ./monotone -help usage: ./monotone -seq1 Readable File [ ] -seq2 Readable File [ ] -matrix text [ blosum62 ] -gap_start float [ 9 ] -gap_extend float [ 3 ] -gap_power float [ 0.5 ] -type text [ log ] -align switch [ true ] -self switch [ false ] -shuffle integer [ 0 ] -opt switch [ true ] -seed integer [ 71166 ] -len1 integer [ 300 ] -len2 integer [ 300 ] -help switch [ ]
-seq1 and -seq2 should be FASTA-format protein sequences
-matrix is a file containing a BLAST-format substitution matrix. If the environment variable BLASTMAT is defined then the program will look there.
The gap penalty function g(k), where k is the gap length, is defined by the -type switch. It must be one of
the gap penalty constants A, B, C are defined using the switches -gap_start (A) and -gap_extend (B) and -gap_popwer (C)
-align is a boolean switch that controls whether the alignment is printed
-self is a boolean switch that controls whether a sequence is compared against itself, if 'true' then only -seq1 need be specified.
the parameters -shuffle, -opt, -len1, -len2 are only used for simulation studies, and are not relevant for comparisons between real sequences.
Consider the alignment between the protein sequences P69892 (Human hemoglobin gamma-2 chain) and AAO61494 (hemoglobin alpha 4 subunit [Takifugu rubripes]).
./monotone -seq1 P69892.fa -seq2 AAO61494.fa -align -type log -matrix BLOSUM62 -gap_extend 7 matrix BLOSUM62 9.000 7.000 seq1 P69892 seq2 AAO61494 gap penalty log is convex lambda(0) 0.3151 alpha 0.0799 lambda(alpha) 0.2835 matrix BLOSUM62 gap log start 9 extend 7 len1 147 len2 141 lambda 0.3151 s 0.4632 alpha 0.0799 theta1 0.8998 theta2 0.9027 K1 0.0598 K2 0.1077 L_GEM 25.92 clower 6.725366 cupper 7.314104 score = 196.881897 5 teedkatitslwgkv--nvedaggetlgrllvvypwtqrffdsfgnlssa 53 P69892 ::::| : :| : || | ::| |: :| :: :| | ::| 3 sqkekrllaevwkslipaaedigsdsllrmftafpgsktyf-shldispr 52 AAO61494 54 saimgnpkvkahgkkvltslgdaikhlddlkgtfaqlselhcdklhvdpe 103 P69892 | | : :||:|:: :: : : : | | | || :| :|| 53 s-----phmlshgrkivlaiaegaqdisqlavnlaplqilhayqlridpt 97 AAO61494 104 nfkllgnvlvtvlaihfgkeftpevqaswqkmvtgvasalssry 147 P69892 ||||| : |: || | | :|||| |: | :: |: |: :| 98 nfkllthcllvalachlgddftpeahaaidkylsafaavlaeky 141 AAO61494 score 196.8819 pval 6.372499e-22
Drop me an email for more details.