Download PDF paper

Download Software

Solaris Version

[an error occurred while processing this directive]

EST_GENOME: A program to align a spliced DNA sequence to an unspliced genomic sequence

est_genome (Version 5.2)

Purpose:

est_genome is a software tool to aid the prediction of genes by sequence homology. The program will align a set of spliced nucleotide sequences (ESTs cDNAs or mRNAs) to an unspliced genomic DNA sequence, inserting introns of arbitrary length when needed. In addition, where feasible introns start and stop at the splice consensus dinucleotides GT and AG.

Unless instructed otherwise, the program makes three alignments: First it compares both stands of the spliced sequence against the forward strand of the genomic, assuming the splice consensus GT/AG (ie in the forward gene direction). The maximum-scoring orientation is then realigned assuming the splice consensus CT/AC (ie in the reversed gene direction). Only the overall maximum-scoring alignment is reported.

The program outputs a list of the exons and introns it has found. The format is like that of MSPcrunch, ie a list of matching segments. This format is easy to parse into other software. The program also indicates, based on the splice site information, the gene's predicted direction of transcription. Optionally the full sequence alignment is printed as well (see the example).

Algorithm:

est_genome uses a linear-space dynamic-programming algorithm. It has the following parameters:

	parameter		default	description

	match			1	score for matching two bases
	mismatch		1	cost for mismatching two bases
	gap_penalty		2	cost for deleting a single base in 
					either sequence, excluding introns
	intron_penalty		40	cost for an intron, independent of length.
	splice_penalty		20	cost for an intron, independent of length and 
					starting/ending on donor-acceptor sites.

	space			100	Space threshold (in  megabytes) 
                                       	for linear-space recursion. If product of
					sequence lengths divided by 4 exceeds this 
                                        then a divide-and-conquer strategy is used to 
                                        control the memory requirements. 
					In this way very long sequences can be aligned. 
					If you have a machine with plenty of memory you 
                                        can raise this parameter (but do not exceed the 
                                        machine's physical RAM). However, normally you 
                                        should not need to change this parameter.

There is no gap initiation cost for short gaps, just a penalty proportional to the length of the gap. Thus the cost of inserting a gap of length L in the EST is

    L*gap_penalty 

and the cost in the genome is

 
    min { L*gap_penalty, intron_penalty } or
    min { L*gap_penalty, splice_penalty } 

if the gap starts with GT and ends with AG (or CT/AC if splice direction reversed). The difference between intron_penalty and splice_penalty allows for some slack in marking the intron end-points. It is often the case that the best intron boundaries, from the point of view of minimising mismatches, will not coincide exactly with the splice consensus, so provided the difference between the intron/splice penalties outweighs the extra mismatch/indel costs the alignment will respect the proper boundaries. If the alignment still prefers boundaries which don't start and end with the splice consensus then this may indicate errors in the sequences.

The default parameters work well, except for very short exons (length < splice_penalty, approx) which may be skipped. The intron penalties should not be set to less that the maximum expected random match between the sequences (typically 10-15 bp) in order to avoid spurious matches.

The algorithm has the following steps:

  1. A first-pass Smith-Waterman scan is done to locate the score, start and end of the maximal scoring segment (including introns of course). No other alignment information is retained.
  2. Subsequences corresponding to the maximal-scoring segments are extracted. If the product of these subsequences' lengths is less than the area parameter then the segments are re-aligned using the Needleman-Wunsch algorithm, which in this instance will give the same result as the Smith-Waterman since they are guaranteed to align end-to-end.
  3. If the product of lengths exceeds the area threshold then the alignment is recursively broken down by splitting the EST in half and finding the genome position which aligns with the EST mid-point. The problem then reduces to aligning the left-hand and right-hand portions of the sequences separately and merging the result.

The worst-case run-time for the algorithm is about 3 times as long as would be taken to align using a quadratic-space program. In practice the maximal-scoring segment is often much shorter than the full genome length so the program runs only about 1.5 times slower.

Under very exceptional circumstances the program is unable to compute the alignment using the space limit specified by the -space command-line option. The program will terminate with an error message. Usually the problem can be solved by re-running the program with increased space. The default value of -space is now 100 so this problem should not recurr.

Command-line parameters:

    sombre[rmott]43: est_genome -help
    
    est_genome -help
    
    usage: est_genome
    	-genome         Readable File        [  ]
    	-est            Readable File        [  ]
    	-match          integer              [ 1 ]
    	-mismatch       integer              [ 1 ]
    	-gap_panalty    integer              [ 2 ]
    	-intron_penalty integer              [ 40 ]
    	-splice_penalty integer              [ 20 ]
    	-minscore       integer              [ 30 ]
    	-splice         switch               [ true ]
    	-align          switch               [ false ]
    	-width          integer              [ 50 ]
    	-mode           text                 [ both | forward | reverse ]
    	-best           switch               [ true ]
    	-space          float                [ 100 ]
    	-verbose        switch               [ false ]
    	-shuffle        integer              [ 0 ]
    	-seed           integer              [ 1280 ]
    	-help           switch               [  ]
    

  • The sequences should be in FASTA-format. The est file can contain more than one sequence in which case all sequences are compared against the genomic sequence.

  • The -mode switch determines the comparion mode. The default value is "both", in which case both strands of the est are compared assuming a forward gene direction (ie GT/AG splice sites), and the best comparsion redone assuming a reversed (CT/AC) gene splicing direction. The other allowed modes are "forward", when just the forward strand is searched, and "reverse", ditto for the reverse strand.
  • Print out all comparisons instead of just the best one with the -nobest swirch
  • If you want to ignore donor-acceptor sites use the switch -nosplice.
  • Exclude alignments with scores below a threshold by setting -minscore.

Example: est_genome -genome CNFG9.seq -est yo13c02.s1.seq -align

Output

MSP type segments

There are four types of segment,

  1. each gapped Exon
  2. each Intron (marked with a ? if it does not start GT and end AG)
  3. the complete alignment Span
  4. individual ungapped matching Segments.

In the above example:

    
    Type     score    %  gstart gstop genome       estart estop EST          EST doc
    
    
    Note Best alignment is between forward est and forward genome, but splice  sites imply REVERSED GENE
    Exon       168  92.3 25669 25874 CNFG9           20   220 yo13c02.s1    519      0    519  SCF
    -Intron    -20   0.0 25875 26278 CNFG9       
    Exon       208  98.6 26279 26492 CNFG9          221   435 yo13c02.s1    519      0    519  SCF
    -Intron    -20   0.0 26493 27390 CNFG9       
    Exon        62  87.4 27391 27477 CNFG9          436   518 yo13c02.s1    519      0    519  SCF
    
    Span       398  94.1 25669 27477 CNFG9           20   518 yo13c02.s1    519      0    519  SCF
    
    Segment      4  83.3 25669 25674 CNFG9           20    25 yo13c02.s1    519      0    519  SCF
    Segment     36  95.0 25676 25715 CNFG9           26    65 yo13c02.s1    519      0    519  SCF
    Segment      1  66.7 25717 25719 CNFG9           66    68 yo13c02.s1    519      0    519  SCF
    Segment      5  85.7 25721 25727 CNFG9           69    75 yo13c02.s1    519      0    519  SCF
    Segment      7 100.0 25729 25735 CNFG9           76    82 yo13c02.s1    519      0    519  SCF
    Segment     53  96.5 25737 25793 CNFG9           83   139 yo13c02.s1    519      0    519  SCF
    Segment     65 100.0 25795 25859 CNFG9          140   204 yo13c02.s1    519      0    519  SCF
    Segment     11  86.7 25860 25874 CNFG9          206   220 yo13c02.s1    519      0    519  SCF
    Segment     33 100.0 26279 26311 CNFG9          221   253 yo13c02.s1    519      0    519  SCF
    Segment    177  98.9 26312 26492 CNFG9          255   435 yo13c02.s1    519      0    519  SCF
    Segment     24  96.2 27391 27416 CNFG9          436   461 yo13c02.s1    519      0    519  SCF
    Segment     18  90.9 27418 27439 CNFG9          462   483 yo13c02.s1    519      0    519  SCF
    Segment     19  87.5 27441 27464 CNFG9          484   507 yo13c02.s1    519      0    519  SCF
    Segment      4 100.0 27466 27469 CNFG9          508   511 yo13c02.s1    519      0    519  SCF
    Segment      5  85.7 27471 27477 CNFG9          512   518 yo13c02.s1    519      0    519  SCF
    

The score for Exon segments is the alignment score excluding flanking intron penalties. The Span score is the total including the intron costs.

The coordinates of the genomic sequence always refer to the positive strand, but are swapped if the est has been reversed. The splice direction of Introns are indicated as +Intron (forward, splice sites GT/AG) or -Intron (reverse, splice sites CT/AC), or ?Intron (unknown direction). Segment entries give the alignment as a series of ungapped matching segments.

Full alignment

You get the alignment if the -align switch is set. The alignment includes the first and last 5 bases of each intron, together with the intron width. The direction of splicing is indicated by >>>> (forward) or <<<< (reverse) or ???? (unknown):

    
    
    CNFG9 vs yo13c02.s1:
    
         CNFG9  25669 ATCAGCGCTGCGGCCGCCCGGAAGCTCATCTTGGCCACCGACTCTCGCTT  25718
                      || |||  |||| ||||||||||||||||||||||||||||||||||  |
    yo13c02.s1     20 ATAAGC-TTGCGACCGCCCGGAAGCTCATCTTGGCCACCGACTCTCG-AT     67
    
         CNFG9  25719 GCGCCGCCGCGGGAGCCGGTGGAAACCTGAGCGGGAGCTGGAGAAGGAGC  25768
                      | | ||||| ||||||| ||||||||||||||||||  ||||||||||||
    yo13c02.s1     68 G-GTCGCCG-GGGAGCC-GTGGAAACCTGAGCGGGACGTGGAGAAGGAGC    114
    
         CNFG9  25769 AGAGGGAGGCAGCACCCGGCGTGACGGGAGTGTGTGGGGCACTCAGGCCT  25818
                      ||||||||||||||||||||||||| ||||||||||||||||||||||||
    yo13c02.s1    115 AGAGGGAGGCAGCACCCGGCGTGAC-GGAGTGTGTGGGGCACTCAGGCCT    163
    
         CNFG9  25819 TCCGCAGTGTCATCTGCCACACGGAAGGCACGGCCACGGGC-CAGGGGGT  25867
                      |||||||||||||||||||||||||||||||||||||||||   ||||||
    yo13c02.s1    164 TCCGCAGTGTCATCTGCCACACGGAAGGCACGGCCACGGGCAGGGGGGGT    213
    
         CNFG9  25868 CTATGATctgga.....catacCTTCTGCATGCCCAGCTGGCATGGCCCC  26306
                      |||||||<<<<< 404 <<<<<||||||||||||||||||||||||||||
    yo13c02.s1    214 CTATGAT...............CTTCTGCATGCCCAGCTGGCATGGCCCC    248
    
         CNFG9  26307 ACGTA-GAGTGGGGGTGGCGTCTCGGTGCTGGTCAGCGACACGTTGTCCT  26355
                      ||||| |||||||| |||||||||||||||||||||||||||||||||||
    yo13c02.s1    249 ACGTAGGAGTGGGGTTGGCGTCTCGGTGCTGGTCAGCGACACGTTGTCCT    298
    
         CNFG9  26356 GGCTGGGCAGGTCCAGCTCCCGGAGGACCTGGGGCTTCAGCTTCCCGTAG  26405
                      ||||||||||||||||||||||||||||||||||||||||||||||||||
    yo13c02.s1    299 GGCTGGGCAGGTCCAGCTCCCGGAGGACCTGGGGCTTCAGCTTCCCGTAG    348
    
         CNFG9  26406 CGCTGGCTGCAGTGACGGATGCTCTTGCGCTGCCATTTCTGGGTGCTGTC  26455
                      ||||||||||||||||||||||||||||||||||||||||||||||||||
    yo13c02.s1    349 CGCTGGCTGCAGTGACGGATGCTCTTGCGCTGCCATTTCTGGGTGCTGTC    398
    
         CNFG9  26456 ACTGTCCTTGCTCACTCCAAACCAGTCGGCGGTCCCCctggc.....ggt  26492
                      |||||||||||||| ||||||||||||||||||||||<<<<< 898 <<<
    yo13c02.s1    399 ACTGTCCTTGCTCATTCCAAACCAGTCGGCGGTCCCC.............    435
    
         CNFG9  26492 acCTGCGGATGGTCTGTGTGATGGACGTCTGGCGTTGCAGCACCGGCCGC  27438
                      <<||||||||||| |||||||||||||| |||  ||||||||||||||||
    yo13c02.s1    435 ..CTGCGGATGGTTTGTGTGATGGACGT-TGGGCTTGCAGCACCGGCCGC    482
    
         CNFG9  27439 CGGAGCTCATGGTGGGGTGAAGAGATGTGGGCTGTCTCG  27477
                      | |||  |||||| |||||||||||| |||| | |||||
    yo13c02.s1    483 C-GAGTCCATGGTNGGGTGAAGAGAT-TGGG-TTTCTCG    518
    
    
    Alignment Score: 398

In this example there are two reverse introns, of lengths 404 and 898 bp.


Release Notes

  • Release 5.2 Increased area for printing genome coordinates so that chromosome-sized coordinates don't cause misalignments.
  • Release 4: Fixed memory leak which crashed on Irix but showed no obvious symptoms on other systems, plus another minor leak with no symptoms.
  • Release 3: Fixed error in alignment positions in some special cases. Thanks to Peter Rice.

Email Richard Mott for more details.

 
 
spacer