Bioinformatics Group

Richard Mott's Home Page


Mouse Resources

HS QTL Project

QTL mapping with HAPPY

Mouse Haplotype Structure

Complex Trait Consortium Meeting, Oxford 1-3 July 2003

SNP Selection Methods


Domain Localisation

DNA-Protein Binding


Sequence Alignment with ARIADNE

Sequence Alignment with monotonic gap penalties

Aligning an EST to a genomic sequence


BIOSAPIENS Network of Excellence

Integrated Genotyping System (IGS)

 

Wellcome Trust Centre for Human Genetics

Sequence Alignment with Monotonic Gap Penalties

This page describes an algorithm to find Smith-Waterman style local alignments using monotonic gap penalties.

See Mott, R (1999) Local sequence alignments with monotonic gap penalties Bioinformatics 6:455-462.

Statistical significance is estimated using the methods used in the ariadne package.

Abstract

MOTIVATION: Sequence alignments obtained using affine gap penalties are not always biologically correct, because the insertion of long gaps is over-penalised. There is a need for an efficient algorithm which can find local alignments using non-linear gap penalties.

RESULTS: A dynamic programming algorithm is described which computes optimal local sequence alignments for arbitrary, monotonically increasing gap penalties, i.e. where the cost g(k) of inserting a gap of k symbols is such that g(k) >= g(k-1). The running time of the algorithm is dependent on the scoring scheme; if the expected score of an alignment between random, unrelated sequences of lengths m, n is proportional to log mn, then with one exception, the algorithm has expected running time O(mn). Elsewhere, the running time is no greater than O(mn(m+n)). Optimisations are described which appear to reduce the worst-case run- time to O(mn) in many cases. We show how using a non-affine gap penalty can dramatically increase the probability of detecting a similarity containing a long gap.

AVAILABILITY: The source code can be donwnloaded for non-commercial use. Code is (C) Richard Mott 1999-2000.

MONOTONE Installation instructions:

To compile the C source code for the monotone program, type

      gcc -o monotone *.c -lm

This will create the binary ./monotone To get help type


      zeon:/projects/rmott/MONOTONE> ./monotone -help

      usage: ./monotone
              -seq1           Readable File        [  ]
              -seq2           Readable File        [  ]
              -matrix         text                 [ blosum62 ]
              -gap_start      float                [ 9 ]
              -gap_extend     float                [ 3 ]
              -gap_power      float                [ 0.5 ]
              -type           text                 [ log ]
              -align          switch               [ true ]
              -self           switch               [ false ]
              -shuffle        integer              [ 0 ]
              -opt            switch               [ true ]
              -seed           integer              [ 71166 ]
              -len1           integer              [ 300 ]
              -len2           integer              [ 300 ]
              -help           switch               [  ]


-seq1 and -seq2 should be FASTA-format protein sequences

-matrix is a file containing a BLAST-format substitution matrix. If the environment variable BLASTMAT is defined then the program will look there.

The gap penalty function g(k), where k is the gap length, is defined by the -type switch. It must be one of
log logarithmic function of the form A + B*log(k)
affine A+B*k
power A + B*k^C
[filename] user-defined in the a file [filename]

the gap penalty constants A, B, C are defined using the switches -gap_start (A) and -gap_extend (B) and -gap_popwer (C)

-align is a boolean switch that controls whether the alignment is printed

-self is a boolean switch that controls whether a sequence is compared against itself, if 'true' then only -seq1 need be specified.

the parameters -shuffle, -opt, -len1, -len2 are only used for simulation studies, and are not relevant for comparisons between real sequences.

Example

Consider the alignment between the protein sequences P69892 (Human hemoglobin gamma-2 chain) and AAO61494 (hemoglobin alpha 4 subunit [Takifugu rubripes]).


       ./monotone -seq1 P69892.fa -seq2 AAO61494.fa -align -type log -matrix BLOSUM62 -gap_extend 7
       matrix BLOSUM62 9.000 7.000 seq1 P69892 seq2 AAO61494
       gap penalty log is convex
       lambda(0) 0.3151 alpha 0.0799 lambda(alpha) 0.2835
       matrix BLOSUM62 gap log start 9 extend 7 len1 147 len2 141 lambda 0.3151 s 0.4632 alpha 0.0799 theta1 0.8998 theta2 0.9027 K1 0.0598 K2 0.1077 L_GEM 25.92
       clower 6.725366 cupper 7.314104
       score = 196.881897
            5 teedkatitslwgkv--nvedaggetlgrllvvypwtqrffdsfgnlssa     53 P69892
              ::::|  :  :|  :    || | ::| |:   :| :: :| |  ::|  
            3 sqkekrllaevwkslipaaedigsdsllrmftafpgsktyf-shldispr     52 AAO61494
       
           54 saimgnpkvkahgkkvltslgdaikhlddlkgtfaqlselhcdklhvdpe    103 P69892
              |     | : :||:|:: :: :  : :  |    | |  ||  :| :|| 
           53 s-----phmlshgrkivlaiaegaqdisqlavnlaplqilhayqlridpt     97 AAO61494
       
          104 nfkllgnvlvtvlaihfgkeftpevqaswqkmvtgvasalssry    147 P69892
              ||||| : |:  || | | :||||  |:  | ::  |: |: :|
           98 nfkllthcllvalachlgddftpeahaaidkylsafaavlaeky    141 AAO61494
       
       score 196.8819 pval 6.372499e-22
       
       

Drop me an email for more details.

 
spacer