BCB @ ISU SplicePredictor Download Help Tutorial References Contact
 

Instructions for SplicePredictorLL


Program description

This older version of SplicePredictor implements logitlinear models for splice site prediction trained on reliable sets of maize and Arabidopsis thaliana genomic sequences as described in reference 1. The predictions are based on the two variables of (i) degree of matching to the splice site consensus and (ii) local compositional contrast. The models assign a P-value between 0 and 1 to each potential splice site such that true sites mostly score high and non-sites mostly score low. The P-values represent intrinsic splice site quality. In otherwise constant context, sites with increased P-value are predicted to result in more efficient splicing (see reference 2). Improvements to the basic model include the context-dependent scores rho and gamma reference 3). The rho-value of a given site is calculated as a weighted product of its P-value times the P-value of its best potential intron-forming complementary splice site; 0 < rho < 1. The gamma-value of a site reflects how well this site fits in the locally predicted splicing pattern. If the given site is in a context that suggests preferred usage of nearby sites as splicing partners to the exclusion of the given site, its gamma-value will be zero. Otherwise it will be a positive value less or equal to 2; high values of gamma would strongly suggest actual usage of the site.

To quickly assess the overall quality of a site we implemented a * grading system: the values of P, rho, and gamma are labeled 5*, 4*, 3*, or 2* if they match or exceed the threshold values for 90%, 80%, 65%, and 50% prediction specificity on the training set, 1* otherwise. The sum of the *-values (attaining values between 3* and 15*) serves as a simple combined measure. For example, sites scoring 14* or 15* are highly reliable (estimated specificity > 90%).

Minimal input to the program consists of a genomic sequence for which potential splice sites are to be listed. Optionally, the user may also supply cDNA/ESTs or "target proteins" which are known or suspected to significantly match the genomic sequence or its translation into encoded amino acids chains. If supplied, the algorithm will return optimal spliced alignments which "thread" the targets into the genomic DNA by scoring for splice sites and sequence similarity in potential exons while allowing for introns as long gaps in the alignment (references 4 and 5).

 

Genomic DNA input

Genomic DNA may be supplied by pasting into the text window or by file upload (type the name of your sequence file or select the file using the Browse option; note that this refers to files residing on your local disk). Alternatively, you may simply supply a GenBank accession number and our server will automatically retrieve the corresponding file from GenBank (no format selection necessary); this sequence retrieval function is based on Bioperl.

Plain sequence format refers to raw sequence data pasted or typed into the sequence area. Sequences should be in the one-letter-code ({a,b,c,g,h,k,m,n,r,s,t,u,w,y}), upper or lower case; all other characters are ignored during input. Multiple sequence input is accepted in FASTA format or in GenBank format.

FASTA format refers to raw sequence data separated by identifier lines of the form ">SQ;name_of_sequence comments". Example:

>SQ;sequence1 - upper case
ACGATTGGATCAAAATCCATGAAAGAGGGGAATCTATAGGCGGAATTGAGGGGGGGATCTCGCCAGCGAC
TGGCTGCCTTGGCGGGGGAGGCCTTGGCGGA

>SQ;sequence2 - upper case with numbering
       1  ACGATTGGAT CAAAATCCAT GAAAGAGGGG AATCTATAGG CGGAATTGAG GGGGGGATCT
      61  CGCCAGCGAC TGGCTGCCTT GGCGGGGGAG GCCTTGGCGG A

>SQ;sequence3 - lower case
acgattggatcaaaatccatgaaagaggggaatctataggcggaattgagggggggatctcgccagcgac
tggctgccttggcgggggaggccttggcgga

>SQ;sequence4 - mixed format
       1  ACGATTGGAT CAAAATCCAT GAAAGAGGGG AATCTATAGG CGGAATTGAG GGGGGGATCT
cgccagcgac
        tggctgcct       tggcggggg       AGGCCTTGGCGGA

GenBank format refers to raw sequence data with possible annotations as in standard GenBank files. Minimal requirements are the LOCUS and ORIGIN lines. Multiple sequences must be separated by // lines.

Sequence name is an optional label applied to sequences supplied in plain sequence format.

Sequence selection - The fields "From position" and "To position" below the sequence pasting area provide for selecting a restricted segment of the input sequence for analysis. Positions refer to numbering of letters in the sequence starting with 1 and increasing 5' to 3'.

Strand selection includes the options "original" (sequence 5' to 3' as pasted; default), "reverse" (sequence complementary to the input; sites are indicated by position numbers referring to the original strand pasted as input), or "both".

 

Parameter field specifications

Species - select either maize or Arabidopsis to use species-specific parameters.

Model - The default model [1] incorporates model parameters derived from sub-classification of splice sites (see reference 1). Model [0] does not involve sub-classification and is provided mostly for reference.

Sensitivity - There always is a trade-off between sensitivity ("How many true sites will be correctly predicted?") versus specificity ("How large is the number of presumably false positive predictions?"). Four settings are optional: "all GU and AG sites" prints out the donor and acceptor model scores at each GU or AG, respectively, in the sequence; "100% learning set" (default) sets the printing threshold at a level that includes all sites that were in our learning sets; "95% learning set" sets the printing threshold at a level that includes 95% of the sites that were in our learning sets; "maximal tau" represents the best compromise between sensitivity and specificity.

Order of sites for display: options are "by position" (order of occurrence 5' to 3' in the sequence; default) or by P- or *-value (donor and acceptor site scores are ordered together, not separately).

P-value threshold - if set, this option overrides the sensitivity criterion selected above.

*-value threshold - if set, this option restricts the selection of sites according to the sensitivity or P-value threshold threshold options.

Display may be limited to the top n scoring donor and acceptor sites per sequence. This option is useful for long input sequences to establish the presence and location of any strong potential introns. This option requires that ordering of sites by either P-value or *-value be selected.

 

Output description

Potential splice sites

Example:

t    q      loc     sequence           P      rho   gamma   *  P*R*G*        parse
  .......
D --->    35713           ccgGTttgt   0.206  0.100  0.191  10 (3 4 3)  IAEEEEE-D-IIIAEED
D ->      35734           tctGTaatt   0.015  0.001  0.000   3 (1 1 1)  AEEEEED-I-IIAEEDI
D -->     35774           atgGTaact   0.223  0.001  0.000   6 (3 2 1)  IIAEEDI-I-IAEEDIA
D ->      35799           ttgGTgtgt   0.008  0.000  0.000   3 (1 1 1)  IAEEDII-I-AEEDIAE
A  <----  35819 ttattaattgcgtAGgt     0.618  0.112  0.538  13 (4 4 5)  AEEDIII-A-EEDIAED
D ->      35820           tagGTtcat   0.005  0.000  0.000   3 (1 1 1)  EEDIIIA-E-EDIAEDA
A     <-  35838 atttcctatacaaAGgg     0.062  0.001  0.000   3 (1 1 1)  EDIIIAE-E-DIAEDIA
D ->      35890           tatGTgatt   0.006  0.000  0.001   3 (1 1 1)  DIIIAEE-D-IAEDIAE
A     <-  35929 tgtgattccttcaAGtt     0.001  0.000  0.000   3 (1 1 1)  DIIAEED-I-AEDIAEE
A     <-  35959 gaatattatcctcAGtt     0.011  0.000  0.008   4 (1 1 2)  IIAEEDI-A-EDIAEEE
A     <-  36011 accccaaatttaaAGgt     0.003  0.000  0.000   3 (1 1 1)  IAEEDIA-E-DIAEEEE
D ----->  36012           aagGTacga   0.922  0.494  0.933  15 (5 5 5)  AEEDIAE-D-IAEEEEE
A     <-  36076 atatattccttgtAGgc     0.084  0.004  0.000   4 (1 2 1)  IADIAED-I-AEEEEED
A <-----  36100 tcgtgttcattgcAGat     0.816  0.345  0.732  15 (5 5 5)  ADIAEDI-A-EEEEEDI
A     <-  36122 tgttacctgagatAGta     0.003  0.000  0.000   3 (1 1 1)  DIAEDIA-E-EEEEDIA
A     <-  36125 tacctgagatagtAGaa     0.007  0.000  0.000   3 (1 1 1)  IAEDIAE-E-EEEDIIA
A     <-  36128 ctgagatagtagaAGct     0.003  0.000  0.000   3 (1 1 1)  AEDIAEE-E-EEDIIAE
A     <-  36148 tgtatcctttctgAGgt     0.001  0.000  0.000   3 (1 1 1)  ADIAEEE-E-EDIIAEE
A     <-  36166 gatgctgcgctaaAGgc     0.001  0.000  0.000   3 (1 1 1)  DIAEEEE-E-DIIAEEE
D ----->  36206           acgGTaatg   0.494  0.398  1.266  14 (4 5 5)  IAEEEEE-D-IIAEEED
D ->      36250           ttgGTattc   0.006  0.000  0.000   3 (1 1 1)  AEEEEED-I-IAEEEDI
A     <-  36271 tgagattatatcaAGag     0.002  0.000  0.000   3 (1 1 1)  IAEEEDI-I-AEEEDII
A <-----  36296 ataatttttctgcAGtc     0.805  0.371  0.778  15 (5 5 5)  AEEEDII-A-EEEDIIA
  .......

Column t: type (D, donor, or A, acceptor)
Column q: quality. The length of the arrow indicates the site quality measured by the *-value:

        ----- = *value 14-15 = highly likely (estimated specificity   >90%)
        ----  = *value 11-13 =    likely     (estimated specificity 60-70%)
        ---   = *value  8-10 =    possible   (estimated specificity 35-45%)
        --    = *value  5- 7 =    uncertain  (estimated specificity 10-20%)
        -     = *value  3- 4 =    doubtful   (estimated specificity   < 5%)
The arrow head points into the predicted intron.

Column loc: site location (position of first or last base of potential intron for D or A, respectively)
Column sequence: site sequence
Column P: P-value
Column rho: rho-value
Column gamma: gamma-value
Column *: *-value
Column P*R*G*: individual *-values for P, rho, and gamma
Column parse: highest scoring assignment of the given site and the seven adjacent sites upstream and downstream as either A (acceptor), D (donor), E (exon), or I (intron)

Spliced alignment - EST (reference 4)

For each EST, the predicted gene structure based on an optimal spliced alignment is displayed. Please note, that such an alignment is always defined and will be displayed. However, if the input EST is unrelated to the genomic DNA, the displayed alignment will be meaningless. Usually, it will be obvious from the display whether the alignment is meaningful or not. The upper line gives the genomic DNA and the lower line gives the EST sequence. Identities are indicated by vertical bars in the center line. Introns are indicated by dots, gaps in the exons by '_'. Coordinates for the predicted exons and introns are given in the list preceding the alignment. Exons are assigned a normalized similarity score (1.000 represents 100% identity). For introns, the list gives the P-values of the donor and acceptor sites as well as a similarity score (s) based on the sequence similarity in the adjacent 50 bases of exon.

Spliced alignment - target protein (reference 5)

The display is essentially as for spliced alignments of ESTs. Sequence similarity is based on amino acid substitution scores (BLOSUM62 matrix). The center line in the alignment gives a vertical bar for identities, '+' for positively scoring (conservative) replacements, and '.' for zero scoring (weakly conservative) replacements.

 

References

    1. Kleffe, J., Hermann, K., Vahrson, W., Wittig, B. and Brendel, V.
    Logitlinear models for the prediction of splice sites in plant pre-mRNA sequences.
    Nucl. Acids Res. 24(23), 4709-4718 (1996)

    2. Brendel, V., Kleffe, J., Carle-Urioste, J.C. and Walbot, V.
    Prediction of splice sites in plant pre-mRNA from sequence properties.
    J. Mol. Biol. 276(1), 85-104 (1998)

    3. Brendel, V. and Kleffe, J.
    Prediction of locally optimal splice sites in plant pre-mRNA with applications to gene identification in Arabidopsis thaliana genomic DNA.
    Nucl. Acids Res. 26(20), 4748-4757 (1998)

    4. Usuka, J., Zhu, W. and Brendel, V.
    Optimal spliced alignment of homologous cDNA to a genomic DNA template.
    Bioinformatics 16(3), 203-211.

    5. Usuka, J. and Brendel, V.
    Optimal spliced alignment of homologous proteins to a genomic DNA template.
    J. Mol. Biol. 297(5), 1075-1085 (2000)

 


 
BCB @ ISU SplicePredictor Download Help Tutorial References Contact