Finding CpG Islands using MAVG

MAVG program

MAVG is a software tool for finding K non-overlapping maximum-average segments of length at least L in a given sequence of numbers, for any K > 0 and L > 0 (Lin et al, 2002). CpG islands of a genomic sequence (Gardiner-Garden and Frommer, 1987) are computed with the MAVG program as follows. The input genomic sequence is converted into a sequence of numbers using a dinucleotide table (Durbin et al., 1998). The table, for each of the 16 different dinucleotides, gives the log likelihood ratio of the frequencies of the dinucleotide in CpG islands and in non-CpG regions. The average score of a segment of the number sequence is the sum of the numbers in the segment divided by the length of the segment. Then the MAVG program is used on the number sequence. .

Input to MAVG

Input sequences file format

MAVG takes as input a file of sequence reads in FASTA format.

FASTA Format:

The first line begins with the symbol '>' followed by the name of the sequence. The sequence is on the remaining lines. The sequence must not contain blanks. The sequence could be in upper or lower case. Below is an example sequence in FASTA format:
>DNA sequence
GCCCCCGGCCCCGCCCCGGCCCCGCCCCCGGCCCCGCCCCGCAAGGGTC
ACAGGTCACGGGGCGGGGCCGAGGCGGAAGCGCCCGCAGCCCGGTACCG
CTCCTCCTGGGCTCCCTCTAGCGCCTTCCCCCCGGCCCGACTCCGCTGG
CAGCGCCAAGTGACTTACGCCCCCGACCTCTGAGCCCGGACCGCTAGGC
GGAGGATCAGATCTCGCTCGAGAATCTGAAGGTGCCCTGGTCCTGGAGG
AGTTCCGTCCCAGCCCGCGGTCTCCCGGTACTGTCGGGCCCCGGCCCTC

Parameters of MAVG

The parameter K in MAVG should be set sufficiently large so that the K best regions reported by MAVG contain at least one region of average score less than the cutoff. This guarantees that no region with average score above the cutoff is missed.

The parameter L should be set to the minimum length of CpG islands.

This MAVG web server was constructed by Liang Ye.

References

Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998)
Biological sequence analysis. Cambridge Univ. Press.

Gardiner-Garden, M. and Frommer, M. (1987)
CpG islands in vertebrate genomes. J. Mol. Biol.,
196, 261-282.

Lin, Y.-L., Huang, X., Jiang, T. and Chao, K.-M. (2003)
MAVG: Locating Non-Overlapping Maximum-Average Segments in a Given Sequence
Bioinformatics, 19, 151-152.