EST_GENOME: A program to align a spliced DNA sequence to an unspliced genomic sequenceest_genome (Version 5.2)Purpose:est_genome is a software tool to aid the prediction of genes by sequence homology. The program will align a set of spliced nucleotide sequences (ESTs cDNAs or mRNAs) to an unspliced genomic DNA sequence, inserting introns of arbitrary length when needed. In addition, where feasible introns start and stop at the splice consensus dinucleotides GT and AG. Unless instructed otherwise, the program makes three alignments: First it compares both stands of the spliced sequence against the forward strand of the genomic, assuming the splice consensus GT/AG (ie in the forward gene direction). The maximum-scoring orientation is then realigned assuming the splice consensus CT/AC (ie in the reversed gene direction). Only the overall maximum-scoring alignment is reported. The program outputs a list of the exons and introns it has found. The format is like that of MSPcrunch, ie a list of matching segments. This format is easy to parse into other software. The program also indicates, based on the splice site information, the gene's predicted direction of transcription. Optionally the full sequence alignment is printed as well (see the example). Algorithm:est_genome uses a linear-space dynamic-programming algorithm. It has the following parameters:
parameter default description
match 1 score for matching two bases
mismatch 1 cost for mismatching two bases
gap_penalty 2 cost for deleting a single base in
either sequence, excluding introns
intron_penalty 40 cost for an intron, independent of length.
splice_penalty 20 cost for an intron, independent of length and
starting/ending on donor-acceptor sites.
space 100 Space threshold (in megabytes)
for linear-space recursion. If product of
sequence lengths divided by 4 exceeds this
then a divide-and-conquer strategy is used to
control the memory requirements.
In this way very long sequences can be aligned.
If you have a machine with plenty of memory you
can raise this parameter (but do not exceed the
machine's physical RAM). However, normally you
should not need to change this parameter.
There is no gap initiation cost for short gaps, just a penalty proportional to the length of the gap. Thus the cost of inserting a gap of length L in the EST is L*gap_penalty and the cost in the genome is
min { L*gap_penalty, intron_penalty } or
min { L*gap_penalty, splice_penalty }
if the gap starts with GT and ends with AG (or CT/AC if splice direction reversed). The difference between intron_penalty and splice_penalty allows for some slack in marking the intron end-points. It is often the case that the best intron boundaries, from the point of view of minimising mismatches, will not coincide exactly with the splice consensus, so provided the difference between the intron/splice penalties outweighs the extra mismatch/indel costs the alignment will respect the proper boundaries. If the alignment still prefers boundaries which don't start and end with the splice consensus then this may indicate errors in the sequences. The default parameters work well, except for very short exons (length < splice_penalty, approx) which may be skipped. The intron penalties should not be set to less that the maximum expected random match between the sequences (typically 10-15 bp) in order to avoid spurious matches. The algorithm has the following steps:
The worst-case run-time for the algorithm is about 3 times as long as would be taken to align using a quadratic-space program. In practice the maximal-scoring segment is often much shorter than the full genome length so the program runs only about 1.5 times slower. Under very exceptional circumstances the program is unable to compute the alignment using the space limit specified by the -space command-line option. The program will terminate with an error message. Usually the problem can be solved by re-running the program with increased space. The default value of -space is now 100 so this problem should not recurr. Command-line parameters:
sombre[rmott]43: est_genome -help
est_genome -help
usage: est_genome
-genome Readable File [ ]
-est Readable File [ ]
-match integer [ 1 ]
-mismatch integer [ 1 ]
-gap_panalty integer [ 2 ]
-intron_penalty integer [ 40 ]
-splice_penalty integer [ 20 ]
-minscore integer [ 30 ]
-splice switch [ true ]
-align switch [ false ]
-width integer [ 50 ]
-mode text [ both | forward | reverse ]
-best switch [ true ]
-space float [ 100 ]
-verbose switch [ false ]
-shuffle integer [ 0 ]
-seed integer [ 1280 ]
-help switch [ ]
Example: est_genome -genome CNFG9.seq -est yo13c02.s1.seq -align OutputMSP type segmentsThere are four types of segment,
In the above example:
Type score % gstart gstop genome estart estop EST EST doc
Note Best alignment is between forward est and forward genome, but splice sites imply REVERSED GENE
Exon 168 92.3 25669 25874 CNFG9 20 220 yo13c02.s1 519 0 519 SCF
-Intron -20 0.0 25875 26278 CNFG9
Exon 208 98.6 26279 26492 CNFG9 221 435 yo13c02.s1 519 0 519 SCF
-Intron -20 0.0 26493 27390 CNFG9
Exon 62 87.4 27391 27477 CNFG9 436 518 yo13c02.s1 519 0 519 SCF
Span 398 94.1 25669 27477 CNFG9 20 518 yo13c02.s1 519 0 519 SCF
Segment 4 83.3 25669 25674 CNFG9 20 25 yo13c02.s1 519 0 519 SCF
Segment 36 95.0 25676 25715 CNFG9 26 65 yo13c02.s1 519 0 519 SCF
Segment 1 66.7 25717 25719 CNFG9 66 68 yo13c02.s1 519 0 519 SCF
Segment 5 85.7 25721 25727 CNFG9 69 75 yo13c02.s1 519 0 519 SCF
Segment 7 100.0 25729 25735 CNFG9 76 82 yo13c02.s1 519 0 519 SCF
Segment 53 96.5 25737 25793 CNFG9 83 139 yo13c02.s1 519 0 519 SCF
Segment 65 100.0 25795 25859 CNFG9 140 204 yo13c02.s1 519 0 519 SCF
Segment 11 86.7 25860 25874 CNFG9 206 220 yo13c02.s1 519 0 519 SCF
Segment 33 100.0 26279 26311 CNFG9 221 253 yo13c02.s1 519 0 519 SCF
Segment 177 98.9 26312 26492 CNFG9 255 435 yo13c02.s1 519 0 519 SCF
Segment 24 96.2 27391 27416 CNFG9 436 461 yo13c02.s1 519 0 519 SCF
Segment 18 90.9 27418 27439 CNFG9 462 483 yo13c02.s1 519 0 519 SCF
Segment 19 87.5 27441 27464 CNFG9 484 507 yo13c02.s1 519 0 519 SCF
Segment 4 100.0 27466 27469 CNFG9 508 511 yo13c02.s1 519 0 519 SCF
Segment 5 85.7 27471 27477 CNFG9 512 518 yo13c02.s1 519 0 519 SCF
The score for Exon segments is the alignment score excluding flanking intron penalties. The Span score is the total including the intron costs. The coordinates of the genomic sequence always refer to the positive strand, but are swapped if the est has been reversed. The splice direction of Introns are indicated as +Intron (forward, splice sites GT/AG) or -Intron (reverse, splice sites CT/AC), or ?Intron (unknown direction). Segment entries give the alignment as a series of ungapped matching segments. Full alignmentYou get the alignment if the -align switch is set. The alignment includes the first and last 5 bases of each intron, together with the intron width. The direction of splicing is indicated by >>>> (forward) or <<<< (reverse) or ???? (unknown):
CNFG9 vs yo13c02.s1:
CNFG9 25669 ATCAGCGCTGCGGCCGCCCGGAAGCTCATCTTGGCCACCGACTCTCGCTT 25718
|| ||| |||| |||||||||||||||||||||||||||||||||| |
yo13c02.s1 20 ATAAGC-TTGCGACCGCCCGGAAGCTCATCTTGGCCACCGACTCTCG-AT 67
CNFG9 25719 GCGCCGCCGCGGGAGCCGGTGGAAACCTGAGCGGGAGCTGGAGAAGGAGC 25768
| | ||||| ||||||| |||||||||||||||||| ||||||||||||
yo13c02.s1 68 G-GTCGCCG-GGGAGCC-GTGGAAACCTGAGCGGGACGTGGAGAAGGAGC 114
CNFG9 25769 AGAGGGAGGCAGCACCCGGCGTGACGGGAGTGTGTGGGGCACTCAGGCCT 25818
||||||||||||||||||||||||| ||||||||||||||||||||||||
yo13c02.s1 115 AGAGGGAGGCAGCACCCGGCGTGAC-GGAGTGTGTGGGGCACTCAGGCCT 163
CNFG9 25819 TCCGCAGTGTCATCTGCCACACGGAAGGCACGGCCACGGGC-CAGGGGGT 25867
||||||||||||||||||||||||||||||||||||||||| ||||||
yo13c02.s1 164 TCCGCAGTGTCATCTGCCACACGGAAGGCACGGCCACGGGCAGGGGGGGT 213
CNFG9 25868 CTATGATctgga.....catacCTTCTGCATGCCCAGCTGGCATGGCCCC 26306
|||||||<<<<< 404 <<<<<||||||||||||||||||||||||||||
yo13c02.s1 214 CTATGAT...............CTTCTGCATGCCCAGCTGGCATGGCCCC 248
CNFG9 26307 ACGTA-GAGTGGGGGTGGCGTCTCGGTGCTGGTCAGCGACACGTTGTCCT 26355
||||| |||||||| |||||||||||||||||||||||||||||||||||
yo13c02.s1 249 ACGTAGGAGTGGGGTTGGCGTCTCGGTGCTGGTCAGCGACACGTTGTCCT 298
CNFG9 26356 GGCTGGGCAGGTCCAGCTCCCGGAGGACCTGGGGCTTCAGCTTCCCGTAG 26405
||||||||||||||||||||||||||||||||||||||||||||||||||
yo13c02.s1 299 GGCTGGGCAGGTCCAGCTCCCGGAGGACCTGGGGCTTCAGCTTCCCGTAG 348
CNFG9 26406 CGCTGGCTGCAGTGACGGATGCTCTTGCGCTGCCATTTCTGGGTGCTGTC 26455
||||||||||||||||||||||||||||||||||||||||||||||||||
yo13c02.s1 349 CGCTGGCTGCAGTGACGGATGCTCTTGCGCTGCCATTTCTGGGTGCTGTC 398
CNFG9 26456 ACTGTCCTTGCTCACTCCAAACCAGTCGGCGGTCCCCctggc.....ggt 26492
|||||||||||||| ||||||||||||||||||||||<<<<< 898 <<<
yo13c02.s1 399 ACTGTCCTTGCTCATTCCAAACCAGTCGGCGGTCCCC............. 435
CNFG9 26492 acCTGCGGATGGTCTGTGTGATGGACGTCTGGCGTTGCAGCACCGGCCGC 27438
<<||||||||||| |||||||||||||| ||| ||||||||||||||||
yo13c02.s1 435 ..CTGCGGATGGTTTGTGTGATGGACGT-TGGGCTTGCAGCACCGGCCGC 482
CNFG9 27439 CGGAGCTCATGGTGGGGTGAAGAGATGTGGGCTGTCTCG 27477
| ||| |||||| |||||||||||| |||| | |||||
yo13c02.s1 483 C-GAGTCCATGGTNGGGTGAAGAGAT-TGGG-TTTCTCG 518
Alignment Score: 398
In this example there are two reverse introns, of lengths 404 and 898 bp.
Release Notes
Email Richard Mott for more details. | |||||
|