The Gibbs sampler stochastically examines candidate alignments in an effort to find the best alignment as measured by the maximum a posteriori (MAP) log-likelihood ratio. This algorithm finds an optimized local alignment model for N sequences in N linear time, and allows the simultaneous detection and optimization of multiple patterns and pattern repeats.
Usually Gibbs sampler exists in two modes: Bernuolli sampler and Site sampler. The first one proposes an initial "quesstimate" about the number of elements of each motif type, although say nothing about the distribution of the elements through the dataset. In the case of Site sampler each sequence must contains one motif. Melina provides Bernoulli sampler as a default.
References: Lawrence, C.E., Altschul, S.F., Bogouski, M.S., Liu, J.S., Neuwald, A.F., and Wooten, J.C., Science, 262,
pp.208-214, (1993)
home pages: http://bayesweb.wadsworth.org/gibbs/
abstract: available here
gibbs sampler options setting
1. Length of the search motif (default: 10)
Length of the search motif
2. Expected number of motifs of each type (default: 10)
Bernoulli sampler is used as a default in Melina and this parameter must be added indicating an initial "guesstimate" of the total number of motif elements for each motif type. Motif type here means a specific order of an appropriate number of nucleotides.
3. Near optimal cutoff (default: 50%)
The Near Optimal Cutoff percent specifies the threshold value (%) for selecting the "good" motifs during the Near optimal sampling. It defines the percent "good" motif must be sampled through the iterations. Near optimal sampling is a process of discarding the pseudomotifs, after the maximal alignment matrix was created. All positions in maximal alignment are marked as "good" and all others positions are being sampled and compared to the "good" positions defined as a threshold until the Max Iterations Number is over. As a result some weakly conserved motifs can be sampled out and some missed motifs can be sampled in.
4. Give seed for random number generator (default: 1000)
This number is used as a seed for random numbers generator. Random numbers generator reproduce sufficient amount of numbers which indicate motifs start positions for the creating of initial motifs alignment in each sampling run.
5. Maximum number of sampling runs (number of seeds) (default: 10)
This value indicates a number of models must be sampled (one model per each run), i.e. number of seeds used by random numbers generator.
6. Number of iterations between successive local maxima (Plateau Period) (default: 20)
After ten sampling steps were repeated per each dataset sequence, program calculates the posterior probability of the obtained alignmnet, which is called as a MAP value. (maximum a posterior) There are four parts to this probability calculation: the motif portion, the background portion, the fragmentation portion and the distribution of the number of motif sites. After that Plateau Period iterations started for the defined number of times. For every Plateau Period iteration MAP value is calculated, and if no improvement in MAP value was observed during 20 (default) iterations, in is suggested that the alignment is stuck in an "energy well" and the alignment will not be improved. If the improvement is observed, sampling continues until the Max Iterations number exceeds. The obtained alignment is reported as maximal alignment.
7. Maximum number of iterations during one sampling run (default: 500)
This parameter specifies the maximum number of allowed iterations for each seed as well as the total iterations for Near Optimal sampling.
8. Pseudo count weights (default: 0.1)
Background pseudocounts are calculated as a percentage of the background observed counts. The user can determine how much weight the pseudocounts hold by specifying the pseudocount weight. The weight will be in the range between 0 and 1 with a default of 0.1 (10% of the observed counts.)
9. Pseudo site weight (default: 0.8)
When specifying the number of sites must be found it is possible to specify prior probabilities on each of the possible number of sites per sequence. Probability of 0, 1, 2, 3, ect. sites to be found per sequence.
Enter, >BLOCKS followed by a list of prior probabilities for each number of sites. For example, if the number of sites is 3,
>BLOCKS
0.05 0.35 0.35 0.25
assigns 5% probability to 0 sites per sequence, 35% to 1 site etc. The values will be normalized, so it is not necessary that they add to 1.
10. DO NOT use fragmentation (default: off)
It is possible that an alignment of motifs results in a MAP value that is not quite the maximal value and result in good but not great alignment.
Here fragmentation allows the sampler to look at the current alignment and the surrounding positions to find which columns should be used in the motif rather than just allowing for motifs that are continuous. The width of the field that can be checked is equal to 5 times the width of the current motif type. However, the field is generally greatly narrowed due to the limited flexibility of individual motif elements caused by overlapping ends of sequences or other motifs. Fragmentation goes through the columns that are currently used and picks out the worst column.
11. Use element order in probabilities (site sampler) (default: off)
Each sequence must contain the same number of elements arranged in the same order.
12. Undertakes random shuffling of input sequences (default: off)
Shuffle the sequences randomly in order to remove any specificity of sequence.
13. DON'T remove protein low complexity regions (default: off)
Removes the low complexity regions if the option is not checked.
CORESEARCH
CORESEARCH is a program for identifying potential functional elements like protein binding sites in DNA sequences, solely from nucleotide sequence data. The algorithm is based on a search for n-tuples (number of motif elements), which occur at least in a minimum percentage of the sequences with no or one mismatch, which may be at any position of the motif. In contrast to functional motifs, random motifs show no preferred pattern of mismatch locations within the motif or in the conservation extended beyond the motif. Selection is carried out by maximization of the information content first for the n-tuple motif, then for a region containing the motif and finally for the complete binding site.
References: Wolfersetter, F., Kornelie, F., Hermann, G., and Werner, T., CABIOS,12, pp.71-81, (1996)
home pages: http://www.gsf.de/biodv/coresearch.html
abstract: available here
coreserch options setting
- 1. Length of the search motif
-
Length of the search motif. Default is 7nt.
- 2. Conservation of sequences in which a motif must occurs
-
Percentage of sequences from total number in which the motif must occurs. The default here is 90%.
- 3. Number of highly conserved positions and allowed distances permitted in a motif
-
Number of highly conserved positions the motif must have followed by the allowed distances. ? mark represents all possible distances.
Example: 3|2,3|1,2,3 means that there must be 3 highly conserved positions. The distance between the first and the secondconserved nucleotides must be 2 or 3, and the distance between the second and third ones must be 1, 2 or 3. Usually an uneven distribution of mismatch location is preferable. So, this option provides the possibility to design 1MM2MM3
- 4. Percentage from the total number of sequences in which motifs with highly conserved positions must occur
-
This parameter is closely related to the above one. It demands a percent of sequences in which the above designed motifs should be found. 90% is used as a default value.
- 5. Maximum number of motif sets for optimal calculation method
-
CORESEARCH constructs motif sets for each basic motif. Motif sets are collected in an exhaustive manner by combining each match of each sequence with the each match to the other sequences in order to determine a motif set with the highest consensus index of the included motifs. 100 motif sets are used as a default.
- 6. Minimum motif set conservation to select a motif
-
In order to construct the best motifs sets CORESEARCH matches every newly found motif to the subset of motifs. If the conservation drops down from the user-defined threshold, the motif is discarded from the subset. The default conservation level defined as 95%.
- 7. Maximum number of the best motif sets to select for each motif
-
This option determines a number of sets that should be created for each motif to choose the best one among them. 5 sets are used as a default here.
Calculation of region consensus index
Identical motifs found in the same sequence are distinguished by their positions in the sequence. Since the biologically correct positions of these identical tuples cannot be identified solely from the information inside the motif flanking bases must be taken into account. Each position set is used for an alignment employing the weight corrected alignment algorithm described in Frech et. al. (1993). From each original set only the top scoring combination is kept. Region consensus index is obtained for all identical motifs in combination with all possible motifs in other sequences in dataset. Individual positions of each set are used as anchors for the alignment, and motif members who do not match the extended consensus (up to 25 nucleotides) of all other members are rejected. This results in a complete consensus description for the binding site, which contains the relative frequency gap at each position.
- 8. Length of region left of motif sets to select for each motif.
- Left flanking region of user-defined number of nucleotides long is analyzed for similarity. Region of 30nt is a default here.
- 9. Length of region right of motif sets to select for each motif.
- Right flanking region of user-defined number of nucleotides long is analyzed for similarity.
Region of 30nt is a default here.
- 10. Maximum number of position sets per set for optimum calculations
- This parameter defines the possible number of considered position sets for each motif set. As a default 20 position sets are assumed.
- 11. Minimum region conservation for a position set.
- Conservation of a position set for each motif set. 97.5% conservation from the maximum region conservation of its tuple is assumed as a default value.
- 12. Minimum similarity of motif with consensus motif to be treated as a consensus.
- Option defines the minimum similarity of every newly found motif with the consensus motif to be termed as a consensus. 0.8 motif elements position similarity is used here as the default.
- 13. Minimum similarity of motif with consensus core string to be included in a list of possible cores
- All sequences are analyzed for motif elements whose position is similar to the position of elements in the position set of consensus motif, which is called as a consensus core string. 0.7 elements position similarity is used as a default here.