MuSeqBox DescriptionMuSeqBox is a program designed for multi-query sequence BLAST output examination. It examines the BLAST output, extracts the informative parameters of BLAST hits, and saves them in tabular form in either text or HTML format. The hit tables are optionally further analyzed with the program to produce subsets of BLAST hits according to user-specified criteria. In particular, BLASTX output may be further analyzed to indicate queries that might potentially be alternatively spliced transcripts (e.g., an extra large segment of insertion or deletion), full-length coding sequences, or contains repeat structures. Users of the program should cite reference. Parameter field specificationsThe output from any MuSeqBox query represents potential candidate structures. Further confirmation should be addressed via global sequence alignment as well as lab experiments. Select queries using user-set criteria: Users may specify user-set criteria for selection of queries with BLAST hits. Queries may be selected if any HSP (i.e., highest-scoring segment pair) hits meet the setting numerical ranges for: QLen—query
sequence length Hlen—HSP length SLen—database
sequence length CovS—percent
coverage of the subject sequence, i.e., HLen relative to SLen CovQ—percent coverage of the query sequence,
i.e., HLen relative to QLen pid—percent
identity in the HSP Gaps—number of gap symbols in the HSP for gapped
alignment Score—alignment bit score Eval—BLAST
search expectation value Criteria are specified with the logical operators: >, >=, <,
and <=. Multiple specifications can be set by checking the
corresponding checkboxes, and are combined with the logical Select queries that globally highly similar to matching protein subjects: Such query selection will require three arguments: pid—minimal
percent identity in each HSP mao—maximal
allowed overlap at either ends of the selected HSPs scv—cumulative
percent coverage of the matched subject sequence (i.e., the sum of CovS for
all selected non-overlapping and/or maximal allowed overlapped selected HSPs) Select queries that potentially encode full-length coding sequences: Such query selections will require six parameters: v5s—maximal
variation of the starting position of the most N-terminal HSP in the protein
subject v3s—maximal
variation of the ending position of the most C-terminal HSP in the protein
subject to SLen v5q—maximal
variation of the 5' end of the query v3q—maximal
variation of the ending position of the 3' end relative to QLEN scv—cumulative
percent coverage of the matched subject sequence (i.e., the sum of CovS for
all non-overlapping HSPs) qcv—cumulative
percent coverage of the query sequence (i.e., the sum of CovQ for all
non-overlapping HSPs
Select queries that represent potential alternatively spliced transcripts: Such query selection will require two parameters: indel—the minimal size of sequence segment type—for extra insertion to the query corresponding to
unmatched residuals in the query between continuous HSPs in the protein
subject or for extra deletion from the query corresponding to unmatched
residuals in the subject between continuous HSPs in the query sequence Note: In the case of single HSP found in the BLAST search, a large insertion may happen to the query sequence in a gapped allowed sequence alignment scoring system. User may set the Gaps criteria from user-set criteria options to select those queries which may indicate a large extra insertion to the query. Select queries that may contain repeats or align to database sequences containing repeats: Such query selection will require two parameters: rps—minimal
potential repeat size (number of nucleotides or amino acids) src—the
origin of such repeats from the query sequence or from the subject sequence. Input format: The default MuSeqBox setting is
to only parse BLAST output that has sequences identified in one of the
standard GenBank formats. In
cases where the sequences blasted are only in FASTA format it is possible to
have MuSeqBox parse the output by selecting NO from the “Require GenBank Formatted
Sequences” pull down menu. The
disadvantage is that HTML output from MuSeqBox will not be able to provide
working web links for sequences that are only in FASTA format.
Print format
The MuSeqBox output consists of three parts: 1) information on selected BLAST program (e.g., BLASTx, BLASTp etc.), print format (e.g., Pstyle) , and query selection criteria (if any); 2) tabulated informative parameters extracted from the BLAST hits; 3) information on the database and used BLAST parameters for conducting a database search. Informative
parameters extracted from the BLAST hits are: QueryID, query identifier (GI number
or ACCESSION) SubjectID, subject identifier (GI
number or ACCESSION) QLen, query sequence length HSP, number of high-scoring segment
pairs (x/y denotes the x-th HSP of a total of y for
that query HLen, HSP length CovQ, HSP percent coverage of the
query sequence (for the first HSP in the above example: 396/581 = 68.2%) Qx, Qy, query sequence coordinates of
the HSP Sx, Sy, subject sequence coordinates
of the HSP SLen, subject sequence length CovS, HSP percent coverage of the
subject sequence (for the first HSP in the above example: 396/(3*303) = 43.6%) Pid, percent identity in the HSP Psi, percent similarity in the HSP NGap, number
of indels in the HSP Frame, reading frame (for BLASTX,
TBLASTN, and TBLASTX; replaced by Sts for BLASTN, indicating the matching
strands) Score, HSP score Eval, expected number of HSPs at the
given score level Db, name of the subject database Annotation, subject sequence
annotation Source, origin of the subject
sequence (species) MuSeqBox provides four print options: 1) Text only (including Db, Annotation and Source); 2) Numeric only (including QLen, HSP, HLen, CovQ, Qx, Qy, Sx, Sy, SLen, CovS, Pid, Psi, Ngap, Frame, Score and Eval); 3) Condensed (including HSP, HLen, CovQ, CovS, Pid, Ngap, Frame, Eval and Annotation); and 4) Detailed which includes all those parameters. For the web-based output, the query identifier (QueryID) and subject identifier (SubjectID) are linked to GenBank nucleotide database or GenPept database according to their types. Furthermore, the expectation values (Eval) are linked to GenBank's nr database for re-submitting BLAST search for the particular query sequence at this record. Occasionally the default formatting of the width of the columns containing the (QueryID) and (SubjectID) will be too small. In these cases the user may correct the problem by checking the ID_length checkbox and then providing a new number of characters to be used as the column widths. |
|
|