BCB @ ISU GeneSeqer Download Help Tutorial References Contact
 


Instructions for GeneSeqer


Program description

GeneSeqer is a gene identification tool based on spliced alignment or "spliced threading" of ESTs with a genomic query sequence. In a spliced alignment, aligned residues in the genomic sequence are assigned exon status. Introns are identfied as large gaps in the alignment, typically (but not necessarily) flanked by the consensus GT and AG dinucleotides at the donor and acceptor sites, respectively. The optimal alignment is derived by scoring for both sequence similarity and potential splice site strength (Reference 1). The program is designed to handle alignment of a large number of ESTs on a long genomic query sequence (BAC size). Therefore, the ESTs are pre-screened, and only ESTs with sufficient significant matching are fully aligned (Reference 2). The fast screen requires pre-processing of the EST database. Several EST collections are maintained and updated by periodically downloading the latest public repositories (dbEST). The user may also supply his or her own EST collection (pre-processing may take some time). The SplicePredictor program provides spliced alignment without the initial screening, including spliced alignment with protein targets (Reference 3).

 

Parameter field specifications

Species - select either 'human', 'mouse', 'rat', 'chicken', 'Drosophila', 'nematode', 'yeast', 'Aspergillus', 'Arabidopsis' [default], or 'maize' to use species-specific parameters for splice site prediction with Bayesian statistical models (Reference 4). The "generic" choice assigns standard values to all GT and AG potential donor and acceptor sites, respectively.

 

Genomic DNA input

Genomic DNA may be supplied by pasting into the text window or by file upload (type the name of your sequence file or select the file using the Browse option; note that this refers to files residing on your local disk). Alternatively, you may simply supply a GenBank accession number and our server will automatically retrieve the corresponding file from GenBank (no format selection necessary); this sequence retrieval function is based on Bioperl.

Plain sequence format refers to raw sequence data pasted or typed into the sequence area. Sequences should be in the one-letter-code ({a,b,c,g,h,k,m,n,r,s,t,u,w,y}), upper or lower case; all other characters are ignored during input. Multiple sequence input is accepted in FASTA format or in GenBank format.

FASTA format refers to raw sequence data separated by identifier lines of the form ">SQ;name_of_sequence comments". Example:

>SQ;sequence1 - upper case
ACGATTGGATCAAAATCCATGAAAGAGGGGAATCTATAGGCGGAATTGAGGGGGGGATCTCGCCAGCGAC
TGGCTGCCTTGGCGGGGGAGGCCTTGGCGGA

>SQ;sequence2 - upper case with numbering
       1  ACGATTGGAT CAAAATCCAT GAAAGAGGGG AATCTATAGG CGGAATTGAG GGGGGGATCT
      61  CGCCAGCGAC TGGCTGCCTT GGCGGGGGAG GCCTTGGCGG A

>SQ;sequence3 - lower case
acgattggatcaaaatccatgaaagaggggaatctataggcggaattgagggggggatctcgccagcgac
tggctgccttggcgggggaggccttggcgga

>SQ;sequence4 - mixed format
       1  ACGATTGGAT CAAAATCCAT GAAAGAGGGG AATCTATAGG CGGAATTGAG GGGGGGATCT
cgccagcgac
        tggctgcct       tggcggggg       AGGCCTTGGCGGA

GenBank format refers to raw sequence data with possible annotations as in standard GenBank files. Minimal requirements are the LOCUS and ORIGIN lines. Multiple sequences must be separated by // lines.

Sequence name is an optional label applied to sequences supplied in plain sequence format.

Sequence selection - The fields "From position" and "To position" below the sequence pasting area provide for selecting a restricted segment of the input sequence for analysis. Positions refer to numbering of letters in the sequence starting with 1 and increasing 5' to 3'.

Strand selection includes the options "original" (sequence 5' to 3' as pasted), "reverse" (sequence complementary to the input; sites are indicated by position numbers referring to the original strand pasted as input), or "both" [default].

 

EST database selection

You may select a pre-processed EST database or supply your own EST collection. The pre-processed EST databases currently available are (results of the Batch Entrez query "'Species'[ORGN] AND EST [KYWD]"):

Last update:    	July 24, 2002

Label           	Species                         # of ESTs
All Plants	        All the following plants
All Monocots            All the monocots from the following plants
All Dicots              All the dicots from the followings plants
Arabidopsis             Arabidopsis thaliana            174,624 
alfalfa                 Medicago sativa                     719 
barley                  Hordeum vulgare                 247,211 
beet                    Beta vulgaris                     6,034 
tree cotton             Gossypium arboreum               38,894 
upland cotton           Gossypium hirsutum                9,461 
ice plant               Mesembryanthemum crystallinum    17,190 
liverwort               Marchantia polymorpha             1,415 
L.japonicus             Lotus japonicus                  31,670 
L.hirsutum              Lycopersicon hirsutum             2,504 
L.pennelli              Lycopersicon pennelli             8,346 
maize                   Zea mays                        167,669 
M.truncatula            Medicago truncatula             163,284 
oat                     Avena sativa                        501 
pine                    Pinus taeda                      60,226 
potato                  Solanum tuberosum                94,258 
rice                    Oryza sativa                    104,973 
rye                     Secale cereale                    8,930 
sorghum                 Sorghum bicolor                  84,712 
S.propinquum            Sorghum propinquum               21,387 
soybean			Glycine max			266,638 
tomato          	Lycopersicon esculentum         148,358 
wheat			Triticum aestivum		191,182  

Drosophila		Drosophila melanogaster		255,455 (April 12)
C.elegans		Caenorhabditis elegans		191,268 (April 12)

 

Output description

For each significantly matching EST, the predicted gene structure based on an optimal spliced alignment is displayed. The upper line gives the genomic DNA and the lower line gives the EST sequence. Identities are indicated by vertical bars in the center line. Introns are indicated by dots, gaps in the exons by '_'. Coordinates for the predicted exons and introns are given in the list preceding the alignment. Exons are assigned a normalized similarity score (1.000 represents 100% identity). For introns, the list gives the P-values (Reference 5) of the donor and acceptor sites as well as a similarity score (s) based on the sequence similarity in the adjacent 50 bases of exon.

Special lines

MATCH gDNAx cDNAy scr lgth cvrg y
where gDNA = name of genomic DNA sequence; x = + (forward strand) or - (reverse strand); cDNA = name of cDNA sequence; y = + (forward strand) or - (reverse strand); scr = alignment score; lgth = cumulative length of scored exons; cvrg = coverage of genomic DNA segment (y = G) or cDNA (y = C) or target protein (y = P), whichever is highest

PGS_gDNAx_cDNAy (a b,c d, ...)
or
PGS_gDNAx_qp (a b,c d, ...)

where gDNA = name of genomic DNA sequence; x = + (forward strand) or - (reverse strand); cDNA = name of cDNA sequence; y = + (forward strand) or - (reverse strand); qp = name of target protein; a, b, c, d, ... = exon coordinates.

The MATCH and PGS lines are useful for summarizing the search results for an application involving multiple genomic DNA sequences and multiple ESTs or target proteins (use a combination of 'egrep' and 'sort'). PGS = Predicted Gene Structure (GenBank CDS-styled exon coordinates).

Consensus gene prediction

For EST matching, the overall gene prediction is summarized the end of the output file in a section labeled "Predicted consensus gene structures". In brief, individual EST alignments are culled to remove weak terminal exon predictions and then assembled into groups of overlapping alignments with respect to the genomic DNA coordinates. Each overlapping cluster of alignments is indicated as PGL (Predicted Gene Location). Within each PGL, alternative exon/intron assignments are indicated by labels AGS (Alternative Gene Structure), followed by the individual PGS lines. Details of the consensus building procedure are discussed in Reference 6.

 

References

    1. Usuka, J., Zhu, W. and Brendel, V. (2000)
    Optimal spliced alignment of homologous cDNA to a genomic DNA template.
    Bioinformatics 16, 203-211.

    2. Goodman, F., Juras, G., Zhu, W. and Brendel, V. (2001)
    Gene discovery by fast spliced threading of ESTs into (large) genomic DNA sequence templates.
    unpublished.

    3. Usuka, J., and Brendel, V. (2000)
    Gene structure prediction by spliced alignment of genomic DNA with protein sequences: Increased accuracy by differential splice site scoring.
    J. Mol. Biol. 297, 1075-1085.

    4. Xing, L., and Brendel, V. (2001)
    Species-specific splice site recognition by sequence inspection using Bayesian statistical models.
    unpublished.

    5. Brendel, V., and Kleffe, J. (1998)
    Prediction of locally optimal splice sites in plant pre-mRNA with applications to gene identification in Arabidopsis thaliana genomic DNA.
    Nucl. Acids Res. 26, 4748-4757.

    6. Zhu, W. and Brendel, V. (2001)
    Gene structure prediction from consensus spliced alignment of multiple ESTs matching the same genetic locus.
    unpublished.

 


 
BCB @ ISU GeneSeqer Download Help Tutorial References Contact