Instructions for SplicePredictor
Program description
The current version of SplicePredictor implements Bayesian models for
splice site prediction trained as described in
Reference 1.
The predictions are implicitely based on the three variables of (i) degree of
matching to the splice site consensus, (ii) local compositional contrast, and
(iii) assessment of 3-base periodicity in coding regions.
The models assign a P-value between 0 and 1 to each potential splice site
such that true sites mostly score high and non-sites mostly score low.
The P-values represent intrinsic splice site quality: in otherwise
constant context, sites with increased P-value are predicted to result in
more efficient splicing (Reference 2).
Improvements to the basic model include the context-dependent scores rho
and gamma (Reference 3).
The rho-value of a given site is calculated as a weighted product of its
P-value times the P-value of its best potential intron-forming complementary
splice site; 0 < rho < 1.
The gamma-value of a site reflects how well this site fits in the locally
predicted splicing pattern.
If the given site is in a context that suggests preferred usage of nearby sites
as splicing partners to the exclusion of the given site, its gamma-value
will be zero.
Otherwise it will be a positive value less or equal to 2; high values of
gamma would strongly suggest actual usage of the site.
To quickly assess the overall quality of a site we implemented a * grading
system: the values of P, rho, and gamma are labeled 5*,
4*, 3*, or 2* if they match or exceed the threshold values for 90%, 80%,
65%, and 50% prediction specificity on a training set, 1* otherwise.
The sum of the *-values (attaining values between 3* and 15*) serves as a
simple combined measure.
For example, sites scoring 14* or 15* are highly reliable (estimated
specificity > 90%).
Currently, precise estimates of specificity were only assessed for maize and
Arabidopsis sites using earlier P-value predictions based on logitlinear
models (Reference 3).
Minimal input to the program consists of a genomic sequence for which potential
splice sites are to be listed.
Optionally, the user may also supply cDNA/ESTs or "target proteins" which are
known or suspected to significantly match the genomic sequence or its
translation into encoded amino acids chains.
If supplied, the algorithm will return optimal spliced alignments which
"thread" the targets into the genomic DNA by scoring for splice sites and
sequence similarity in potential exons while allowing for introns as long gaps
in the alignment (References 4 and 5).
Genomic DNA input
Genomic DNA may be supplied by pasting into the text window or by file upload
(type the name of your sequence file or select the file using the Browse
option; note that this refers to files residing on your local disk).
Alternatively, you may simply supply a GenBank accession number and our server
will automatically retrieve the corresponding file from
GenBank
(no format selection necessary); this sequence retrieval function is based on
Bioperl.
Plain sequence format
refers to raw sequence data pasted or typed into the sequence area.
Sequences should be in the one-letter-code ({a,b,c,g,h,k,m,n,r,s,t,u,w,y}),
upper or lower case; all other characters are ignored during input.
Multiple sequence input is accepted in FASTA format
or in GenBank format.
FASTA format
refers to raw sequence data separated by identifier lines of the form
">SQ;name_of_sequence comments". Example:
>SQ;sequence1 - upper case
ACGATTGGATCAAAATCCATGAAAGAGGGGAATCTATAGGCGGAATTGAGGGGGGGATCTCGCCAGCGAC
TGGCTGCCTTGGCGGGGGAGGCCTTGGCGGA
>SQ;sequence2 - upper case with numbering
1 ACGATTGGAT CAAAATCCAT GAAAGAGGGG AATCTATAGG CGGAATTGAG GGGGGGATCT
61 CGCCAGCGAC TGGCTGCCTT GGCGGGGGAG GCCTTGGCGG A
>SQ;sequence3 - lower case
acgattggatcaaaatccatgaaagaggggaatctataggcggaattgagggggggatctcgccagcgac
tggctgccttggcgggggaggccttggcgga
>SQ;sequence4 - mixed format
1 ACGATTGGAT CAAAATCCAT GAAAGAGGGG AATCTATAGG CGGAATTGAG GGGGGGATCT
cgccagcgac
tggctgcct tggcggggg AGGCCTTGGCGGA
GenBank format
refers to raw sequence data with possible annotations as in standard GenBank
files.
Minimal requirements are the LOCUS and ORIGIN lines.
Multiple sequences must be separated by // lines.
Sequence name
is an optional label applied to sequences supplied in plain
sequence format.
Sequence selection
- The fields "From position" and "To position" below the sequence
pasting area provide for selecting a restricted segment of the input sequence
for analysis.
Positions refer to numbering of letters in the sequence starting with 1 and
increasing 5' to 3'.
Strand
selection includes the options "original" (sequence 5' to 3' as pasted;
default), "reverse" (sequence complementary to the input; sites are indicated
by position numbers referring to the original strand pasted as input), or
"both".
Parameter field specifications
Species
- select the most appropriate splice site models. This parameter must be
specified. Options: "human", "mouse", "rat", "chicken", "Drosophila", "nematode",
"yeast", "Aspergillus", "Arabidopsis", "maize", "generic".
Cutoff
- There always is a trade-off between sensitivity ("How many true sites
will be correctly predicted?") versus specificity ("How large is the
number of presumably false positive predictions?"). For SplicePredictor,
sensitivity and specificity are controlled by the critical value
c = 2 ln BF, where BF is the Bayes Factor (ratio of posterior to
prior oddds that a given site is a true splice site). Higher values of c
increase specificity but decrease sensitivity
(Reference 1).
Local Pruning
- This option restricts the number of sites displayed. Within a local sequence
context, otherwise qualifying but clearly suboptimally scoring sites are not
printed if this option is selected ("on" by default).
Score non-canonical sites
- If this option is selected, then all non-canonical dinucleotides are accepted
as potential splice sites and scored. This may be useful to predict borders of
non-canonical introns (i.e., introns with ends other than GT-AG, GC-AG, or
AT-AC).
Order
of sites for display: options are "by position" (order of occurrence 5' to 3' in
the sequence; default) or by P- or *-value (donor and acceptor site scores are
ordered together, not separately).
P-value threshold -
if set, this option overrides the cutoff
criterion selected above.
*-value threshold -
if set, this option restricts the selection of sites according to the
cutoff or
P-value threshold threshold options.
Display
may be limited to the top n scoring donor and acceptor sites per
sequence.
This option is useful for long input sequences to establish the presence and
location of any strong potential introns.
This option requires that ordering of sites by either
P-value or *-value be
selected.
Output description
Potential splice sites
Example:
t q loc sequence P c rho gamma * P*R*G* parse
.......
D -----> 35713 ccgGTttgt 0.989 9.54 0.489 1.974 15 (5 5 5) ADAEDIA-D-AEDADAD
A <---- 35819 ttattaattgcgtAGgt 0.993 10.50 0.337 1.912 13 (5 3 5) DAEDIAD-A-EDADADA
D --> 35859 ctgGTtctg 0.837 3.85 0.000 0.000 5 (3 1 1) AEDIADA-E-DADADAD
D ----> 36012 aagGTacga 0.919 5.44 0.471 0.082 11 (4 5 2) ADIADAE-D-ADADADA
A <----- 36100 tcgtgttcattgcAGat 0.979 8.28 0.900 1.898 14 (4 5 5) DIADAED-A-DADADAD
D -----> 36206 acgGTaatg 0.993 10.38 0.971 1.971 15 (5 5 5) IADAEDA-D-ADADADA
A <----- 36296 ataatttttctgcAGtc 0.978 8.16 0.971 1.971 14 (4 5 5) ADAEDAD-A-DADADAD
D -----> 36432 cagGTatgg 0.999 13.92 0.355 1.977 14 (5 4 5) DAEDADA-D-ADADADA
A <---- 36520 acattgcgataacAGgc 0.998 12.81 0.333 1.809 13 (5 3 5) AEDADAD-A-DADADIA
D ----> 36543 ccgGTgaga 0.811 3.49 0.800 1.797 13 (3 5 5) EDADADA-D-ADADIAE
A <----- 36721 ttcgaatctgatcAGgt 0.986 9.15 0.800 1.797 14 (4 5 5) DADADAD-A-DADIAED
D -----> 36722 cagGTgagt 0.993 10.61 0.352 1.980 14 (5 4 5) ADADADA-D-ADIAEED
A <---- 36815 ggatgaatgaaacAGga 0.999 13.83 0.334 1.923 13 (5 3 5) DADADAD-A-DIAEEED
.......
Column t: type (D, donor, or A, acceptor)
Column q: quality. The length of the arrow indicates the site quality
measured by the *-value:
----- = *value 14-15 = highly likely (estimated specificity >90%)
---- = *value 11-13 = likely (estimated specificity 60-70%)
--- = *value 8-10 = possible (estimated specificity 35-45%)
-- = *value 5- 7 = uncertain (estimated specificity 10-20%)
- = *value 3- 4 = doubtful (estimated specificity < 5%)
The arrow head points into the predicted intron.
Column loc: site location (position of first or last base of potential
intron for D or A, respectively)
Column sequence: site sequence
Column P: P-value
Column c: c-value
Column rho: rho-value
Column gamma: gamma-value
Column *: *-value
Column P*R*G*: individual *-values for P, rho, and gamma
Column parse: highest scoring assignment of the given site and the
seven adjacent sites upstream and downstream as either A (acceptor), D (donor),
E (exon), or I (intron)
Spliced alignment - EST (Reference 4)
For each EST, the predicted gene structure based on an optimal spliced
alignment is displayed.
Please note, that such an alignment is always defined and will be displayed.
However, if the input EST is unrelated to the genomic DNA, the displayed
alignment will be meaningless.
Usually, it will be obvious from the display whether the alignment is
meaningful or not.
The upper line gives the genomic DNA and the lower line gives the EST
sequence.
Identities are indicated by vertical bars in the center line.
Introns are indicated by dots, gaps in the exons by '_'.
Coordinates for the predicted exons and introns are given in the list
preceding the alignment.
Exons are assigned a normalized similarity score (1.000 represents 100%
identity).
For introns, the list gives the P-values of the donor and acceptor sites as
well as a similarity score (s) based on the sequence similarity in the
adjacent 50 bases of exon.
Spliced alignment - target protein
(Reference 5)
The display is essentially as for spliced alignments of ESTs.
Sequence similarity is based on amino acid substitution scores (BLOSUM62
matrix).
The center line in the alignment gives a vertical bar for identities,
'+' for positively scoring (conservative) replacements, and '.' for zero
scoring (weakly conservative) replacements.
References
1. Xing, L. and Brendel, V.
Species-specific splice site recognition by sequence inspection
using Bayesian statistical models.
Nucl. Acids Res., submitted March 28, 2003.
2. Kleffe, J., Hermann, K., Vahrson, W., Wittig, B. and Brendel, V.
Logitlinear models for the prediction of splice sites in plant pre-mRNA
sequences.
Nucl. Acids Res. 24(23), 4709-4718 (1996)
3. Brendel, V. and Kleffe, J.
Prediction of locally optimal splice sites in plant pre-mRNA with
applications to gene identification in Arabidopsis thaliana genomic
DNA.
Nucl. Acids Res. 26(20), 4748-4757 (1998)
4. Usuka, J., Zhu, W. and Brendel, V.
Optimal spliced alignment of homologous cDNA to a genomic DNA template.
Bioinformatics 16(3), 203-211.
5. Usuka, J. and Brendel, V.
Optimal spliced alignment of homologous proteins to a genomic DNA template.
J. Mol. Biol. 297(5), 1075-1085 (2000)
|