BCB @ ISU

 

MuSeqBox

 

Download

 

Help

 

Tutorial

 

References

 

Contact

 

 

MuSeqBox Description

MuSeqBox is a program designed for multi-query sequence BLAST output examination. It examines the BLAST output, extracts the informative parameters of BLAST hits, and saves them in tabular form in either text or HTML format. The hit tables are optionally further analyzed with the program to produce subsets of BLAST hits according to user-specified criteria. In particular, BLASTX output may be further analyzed to indicate queries that might potentially be alternatively spliced transcripts (e.g., an extra large segment of insertion or deletion), full-length coding sequences, or contains repeat structures. Users of the program should cite reference.

Parameter field specifications

The output from any MuSeqBox query represents potential candidate structures. Further confirmation should be addressed via global sequence alignment as well as lab experiments.

 

Select queries using user-set criteria:    Users may specify user-set criteria for selection of queries with BLAST hits. Queries may be selected if any HSP (i.e., highest-scoring segment pair) hits meet the setting numerical ranges for: 

QLen—query sequence length

Hlen—HSP length

SLen—database sequence length

CovS—percent coverage of the subject sequence, i.e., HLen relative to SLen 

CovQpercent coverage of the query sequence, i.e., HLen relative to QLen

pid—percent identity in the HSP

Gapsnumber of gap symbols in the HSP for gapped alignment

Scorealignment bit score

Eval—BLAST search expectation value

Criteria are specified with the logical operators: >, >=, <, and <=. Multiple specifications can be set by checking the corresponding checkboxes, and are combined with the logical AND. For example, to select queries with sequence length at least 600nt and with BLAST expectation value less than 1e-10, the user first checks the two checkboxes to the QLen and Eval, then fills the two blanks with 600nt and 1e-10, respectively.

 

Select queries that globally highly similar to matching protein subjects:   Such query selection will require three arguments:

pid—minimal percent identity in each HSP

mao—maximal allowed overlap at either ends of the selected HSPs

scv—cumulative percent coverage of the matched subject sequence (i.e., the sum of CovS for all selected non-overlapping and/or maximal allowed overlapped selected HSPs)

 

Select queries that potentially encode full-length coding sequences:    Such query selections will require six parameters:

v5s—maximal variation of the starting position of the most N-terminal HSP in the protein subject

v3s—maximal variation of the ending position of the most C-terminal HSP in the protein subject to SLen

v5q—maximal variation of the 5' end of the query

v3q—maximal variation of the ending position of the 3' end relative to QLEN

scv—cumulative percent coverage of the matched subject sequence (i.e., the sum of CovS for all non-overlapping HSPs)

qcv—cumulative percent coverage of the query sequence (i.e., the sum of CovQ for all non-overlapping HSPs

 

 

 

 

 

Select queries that represent potential alternatively spliced transcripts:    Such query selection will require two parameters:

indel—the minimal size of sequence segment

type—for extra insertion to the query corresponding to unmatched residuals in the query between continuous HSPs in the protein subject or for extra deletion from the query corresponding to unmatched residuals in the subject between continuous HSPs in the query sequence

Note: In the case of single HSP found in the BLAST search, a large insertion may happen to the query sequence in a gapped allowed sequence alignment scoring system. User may set the Gaps criteria from user-set criteria options to select those queries which may indicate a large extra insertion to the query.

 

Select queries that may contain repeats or align to database sequences containing repeats:    Such query selection will require two parameters:

rps—minimal potential repeat size (number of nucleotides or amino acids)

srcthe origin of such repeats from the query sequence or from the subject sequence.

 

Input format:  The default MuSeqBox setting is to only parse BLAST output that has sequences identified in one of the standard GenBank formats.  In cases where the sequences blasted are only in FASTA format it is possible to have MuSeqBox parse the output by selecting NO from the “Require GenBank Formatted Sequences”  pull down menu. The disadvantage is that HTML output from MuSeqBox will not be able to provide working web links for sequences that are only in FASTA format.

 

Print format

The MuSeqBox output consists of three parts: 1) information on selected BLAST program (e.g., BLASTx, BLASTp etc.), print format (e.g., Pstyle) , and query selection criteria (if any); 2) tabulated informative parameters extracted from the BLAST hits; 3) information on the database and used BLAST parameters for conducting a database search.

Informative parameters extracted from the BLAST hits are:

QueryID, query identifier (GI number or ACCESSION)

SubjectID, subject identifier (GI number or ACCESSION)

QLen, query sequence length

HSP, number of high-scoring segment pairs (x/y denotes the x-th HSP of a total of y for that query

HLen, HSP length

CovQ, HSP percent coverage of the query sequence (for the first HSP in the above example: 396/581 = 68.2%)

Qx, Qy, query sequence coordinates of the HSP

Sx, Sy, subject sequence coordinates of the HSP

SLen, subject sequence length

CovS, HSP percent coverage of the subject sequence (for the first HSP in the above example: 396/(3*303) = 43.6%)

Pid, percent identity in the HSP

Psi, percent similarity in the HSP

NGap, number of indels in the HSP

Frame, reading frame (for BLASTX, TBLASTN, and TBLASTX; replaced by Sts for BLASTN, indicating the matching strands)

Score, HSP score

Eval, expected number of HSPs at the given score level

Db, name of the subject database

Annotation, subject sequence annotation

Source, origin of the subject sequence (species)

MuSeqBox provides four print options: 1) Text only (including Db, Annotation and Source); 2) Numeric only (including QLen, HSP, HLen, CovQ, Qx, Qy, Sx, Sy, SLen, CovS, Pid, Psi, Ngap, Frame, Score and Eval); 3) Condensed (including HSP, HLen, CovQ, CovS, Pid, Ngap, Frame, Eval and Annotation); and 4) Detailed which includes all those parameters. 

For the web-based output, the query identifier (QueryID) and subject identifier (SubjectID) are linked to GenBank nucleotide database or GenPept database according to their types. Furthermore, the expectation values (Eval) are linked to GenBank's nr database for re-submitting BLAST search for the particular query sequence at this record.

Occasionally the default formatting of the width of the columns containing the (QueryID) and (SubjectID) will be too small.  In these cases the user may correct the problem by checking the ID_length checkbox and then providing a new number of characters to be used as the column widths.

 

 

 

 

BCB @ ISU

 

MuSeqBox

 

Download

 

Help

 

Tutorial

 

References

 

Contact