Richard Mott's Home Page

Group Home Page

Sequence Alignment with monotonic gap penalties


Introduction

ariadne

prospero

file formats

installation

output

statistics

bugs

legal matters

 

Wellcome Trust Centre for Human Genetics

Input File Formats

  • Substitution matrices (required for sequence to sequence comparisons) must be in BLAST format. [ARIADNE will look for matrices in the directory pointed to by the environment variable BLASTMAT if it is defined.]
  • Sequence libraries must be in FASTA format. e.g
    >NP_005359
    mglsdgewqlvlnvwgkveadipghgqevlirlfkghpetlekfdkfkhlksedemkase
    dlkkhgatvltalggilkkkghheaeikplaqshatkhkipvkylefiseciiqvlqskh
    pgdfgadaqgamnkalelfrkdmasnykelgfqg
    
  • Profiles. Format for each profile is a 3-line header followed by the profile data. Blank lines are ignored. The header looks like
    >name             		name of the profile, eg "globin"
    profile-length alphabet-length  eg "141 21"
    alphabet-order                  List of space-separated amino-acids in the order they occur in the profile
    
    The profile contains a series of lines, one per profile position, with format

    [position] [consensus] [profile-scores in order defined by alphabet-order]

    For example, part of a profile looks like this:

    >globin
    141 21
    A R N D C Q E G H I L K M F P S T W Y V X
        1 H   -4  -1   2  -2  -5  -1  -1  -3  10  -5  -5  -2  -4  -3  -4  -2  -3  -4   0  -5 -1
        2 L   -4  -5  -6  -7   0  -5  -6  -6  -5  -1   6  -5   0   0  -6  -5  -4   6  -3  -2 -1
        3 S   -1  -4  -1  -1  -4  -2  -2  -2  -4  -4  -4  -3  -2  -5  -4   5   6  -5  -5  -3 -1
        4 A    4  -3   0   3  -1   0   2  -1  -2  -3  -3  -1  -2  -3   0   0  -2  -6  -3  -3 -1
        5 E    2  -1  -1   3  -3   0   3  -1   0  -5  -3   1  -4  -3  -1  -1  -1  -6  -4  -3 -1
        6 E   -3  -3   0   4  -6   4   5  -4  -3  -3  -5  -2  -4  -6  -4  -2  -1  -6  -5  -3 -1
        7 K    0   3  -2  -4  -2  -1   1  -4  -1   0  -3   5  -2  -1  -4  -2  -3   4  -4  -1 -1
        8 A    3  -1   0   0  -2   2   0  -2   1  -2  -2   1  -3  -2  -4   1   1  -5  -4  -3 -1
        9 L    2  -2   2  -2  -2  -2  -1  -4   0   1   3   0  -1  -2  -5  -2   0  -5  -2   0 -1
       10 V   -2  -6  -6  -6  -4  -5  -5  -6  -6   5   1  -5  -1  -2  -5  -4  -3  -6  -4   6 -1
       11 K   -1   3   1  -2  -3   2  -2  -3   0  -3   0   5   0  -5  -4  -1   1  -5  -4  -2 -1
       12 A    2  -1   0   1  -3   0   0   0   1  -4  -3   1  -2  -4  -4   3   0  -6  -4  -3 -1
       13 L    0  -4  -2  -3   0   0  -2  -4  -1   1   1  -3  -1  -1  -4   3   2  -5  -3   1 -1
       14 W   -3  -6  -6  -5  -2  -5  -4  -4  -5  -2  -4  -6  -2   2  -7  -4  -4  13   0  -2 -1
    ........... etc etc
    
    So eg at position 2 the consensus is L and the score for matching an A at this position is -4

    A library of profiles can contain multiple profiles.

    Profiles and substitution matrices should be such that the smallest span of score values is 1 Also, for reasons of efficiency it is best is the total range of score values is quite small, say < 30

    Some example profile sets derived from PFAM-A are provided for download.


Please send Questions, Comments, and Bug Reports to Richard Mott

 
spacer