parameter vocabulary page

go to the top page

CONSENSUS

The CONSENSUS is a method that identifies the recognition pattern for a DNA-binding protein given only a collection of sequenced DNA fragments. Information about the position and orientation of the binding sites within the fragments is not needed. The method compares the "information content" of a large number of possible binding sites alignments to arrive at a matrix representation of the binding site pattern.
The specificity of the protein is represented as a matrix and a consensus sequence, allowing patterns that are typical of regulatory protein-binding sites to be identified. The reliability of the method improves as the number of sequences increases, but the time required also increases linearly with the number of sequences.

 References: G.Z. Hertz and G.D. Stormo, Bioinformatics, 15, pp.563-577 (1999), 
   home pages:     http://ural.wustl.edu/
       abstract:        available here

consensus options setting

1. Length of the search motif (default: 10nt.): Length of the search motif.
2. Number of matrix to save (default: 1000): The maximum number of matrices to save is determined here. In practice less matrices are ultimately saved because many of the matrices initially saved are identical to each other. The number of top matrices to be output is defined by user below.
3. Number of cycles if 0 or more motifs per sequence are required: Repeat the matrix building cycle a maximum of "integer" times and allow each sequence to contribute zero or more motifs per matrix. The default finds one motif per sequence. The higher option value the larger the number of motifs found in the dataset, even this option does not guarantee that motifs would be extracted from all sequences. Option 3 is mutually exclusive with option 4.
4. Number of cycles if 1 or more motifs per sequence are required: Repeat the matrix building cycle a maximum of "integer" times and allow each sequence to contribute one or more motifs per matrix. The default looks for one motif per sequence. The higher the option's value, the more motifs in each each sequence could be found.
5. Minimum distance between found motifs: The minimum distance between the starting points of motifs within the same matrix pattern. The value must be a positive integer. Default is defined to be equal to the length of the search motif. If both strands are treated as as a single sequence the option also indicates the minimum distance between a start of the motif and the end of a motif on the complementary strand (i.e., orientation unknown). This option can only be used in the combination with options 3 and 4.
6. Number of cycles after the most significant alignment: Terminate the program "integer" cycles after the current most significant alignment is identified. In the default mode program terminates only when the maximum number of matrix building cycles is completed and the best alignment obtained from all sequences. If you use this option, the program stops after the Top Alignment Matrix is obtained, i.e. not all sequences would be involved in the construction of alignment matrix. This parameter is very effective in you are not sure that the motif exists in all sequences.
7. Use designated prior frequencies: Use the designated prior probabilities of the letters to override the observed frequencies. By default, the program uses the frequences observed in your own sequence data for the prior probabilities of the letters. However, if this option is setup, it is assumed that the letters in the motifs' background sequences are independent and identically distributed. For alignments containing very few sequences, the results are more accurate if the dataset current nucleotide distribution is used.
8. Seed with the first sequence and proceed linearly through the list: Seed with the first sequence and proceed linearly though the list. The option results in a significant speed up of the program, but each of the k-words of the first sequence are used as an initial set of "interesting matrix" and the result will be very biased to the content of the first sequence.
9. Options for handling the complement of nucleic acid sequences the four options in this section are mutually exclusive: 1. ignore the complement (default)
2. include both strands as separate sequences
3. include both strands as a single sequence (i.e., orientation unknown)
4. assume that pattern is symmetrical
10. Save the top progeny for each parental matrix (default): Try to save the top progeny matrices for each parental matrix. This option prevents a strong pattern found in only a subset of sequences from overwhelming the algorithm and elimination of other potential patterns. This undesirable situation can occur when a subset of the sequences share an evolutionary relationship not common to the majority of the sequences.
11. Save the top progeny matrices regardless of parentage.: Do not save top progeny matrices.
12. Number of top matrices to print: The number of matrices to print of the top matrices from each cycle. A negative value means print all the top matrices.

This first list contains the matrices having the highest information content from each cycle in decreasing statistical significance order (i.e., increasing expected frequency). In general, this first list will contains the most interesting alignment.
13. Number of final matrices to print: The number of matrices to print of the matrices saved from the final cycle. Print no matrices when options 3 or 4 are used.

The second list contains the matrices saved after the final cycle of the program, also in decreasing statistical significance order. Generally, this latter list will be useful when the user wishes each sequence to contribute exactly one word to the final alignment (i.e., when options 3 and 4 are not used).

MEME
Multiple EM for Motif Elicitation
MEME discovers one or more motifs in a collection of DNA sequences by using the technique of expectation maximization (EM) to fit a two-component finite mixture model to the set of sequences. The algorithm estimates how many times each motif occurs in each sequence in the datasets and outputs the alignment of the occurences of the motif. Patterns with variable-length are splitted by MEME into two or more motifs. MEME uses statistical modeling techniques to automatically choose the best width, number of occurrences and description for each motif.

home pages: http://meme.sdsc.edu/meme/ abstract: available here
meme options setting
1. Length of the search motif: Length of a single motif. MEME chooses the optimal length of each motif individually using a statistical heuristic function. You can choose different limits for the minimum and maximum motif length that MEME will consider. The length of each motif that MEME reports will lies within the limits you choose.
2. Number of sites at which motif must occurs: This is the total number of sites in the data set where a single motif occurs. If you have a prior knowledge about the number of occurrences of the motif in your data set, limiting MEME's search in this way can increase the likelihood of MEME finding true motifs. For example, if you know that each motif is likely to occur at least 5 times but no more than 8 times in the training set, you could specify: Minimum sites = 5
Maximum sites = 8
MEME may still find motifs with slightly fewer or more occurrences then those you specify. In the above example, if there is a motif in the training set with only 4 occurrences, MEME may still find it, but it will report 5 occurrences, one of which will be erroneous. Likewise, if a motif in the training set has 9 occurrences, MEME will probably still find it, but it will report only 8 of its occurrences. This option can not be used in the case you need to find "one occurence per sequence" or "zero or one occurence per sequence". The default values are reported below:

Default Numbers of Sites for each Motif
type of distribution minimum sites maximum sites
one occurrence per sequence n n
zero or one occurrence per sequence sqrt(n) n
any number of repetitions per sequence sqrt(n) min(5*n, 50)
3. Maximum number of motifs to find: MEME will look for up to this number of distinct motifs in the training set. MEME will stop when this number of motifs has been found, or when none can be found with E-value less than 100.
4. Motif distribution menu: 1. If you choose the first option, MEME assumes that each sequence in the dataset contains exactly one occurrence of each motif. This option is the fastest and most sensitive but the motifs returned by MEME may be "blurry", if any of the sequences is missing them.

2. If you choose the second option, MEME assumes that each sequence may contain at most one occurrence of each motif. This option is useful when you suspect that some motifs may be missing from some of the sequences. In that case, the motifs found will be more accurate than using the first option. This option takes more computer time than the first option (about twice as much) and is slightly less sensitive to weak motifs present in all of the sequences.

3. If you choose the third option, MEME assumes each sequence may contain any number of non-overlapping occurrences of each motif. This option is useful when you suspect that motifs repeat multiple times within a single sequence. In that case, the motifs found will be much more accurate than using one of the other options. This option can also be used to discover repeats within a single sequence. This option takes the much more computer time than the first option (about ten times as much) and is somewhat less sensitive to weak motifs which do not repeat within a single sequence than the other two options.
5. Stop if motif E-value is greater then user-defined value: Here you can decide the E-value for the expected frequency of the search motif. Algorithm stops if it becomes higher then the defined number. Default is 0.01.
6. Number of Expectation Maximization algorithm runs per the applied dataset: Algorithm stops running after the defined number of iterations is exceeded. Default is 50 runs.
7. Motif trimmimg using multiple alignment: After the algorithm created the final alignment it adds a half of the motifs's length to the both sides of the motif and investigate each position for the possibility to be better than the present position in the alignment. It is called as a - trimming mode. In that case three options below are also used. If you uncheck the box, this procedure will not be undertaken.
8. Gap opening cost for multiple alignments: If -trimming mode (the above option) is used you need to set up the gap opening cost. Default is 11.
9. Gap extension cost for multiple alignments: If -trimming mode (the above option) is used you need to set up the gap extention cost. Default is 1.
10. Do not count end gaps in multiple alignments: In the default algorithm does not consider the end gaps cost for the total alignment score calculation. But, if you uncheck this option, it will.
11. Use complementary stard for search: As a rule MEME searches for motifs on both the given DNA strands and the reverse complement strand by default. Unchecking this box will cause MEME to search the given DNA strand only.
12. Force palindroms: Checking this box causes MEME to search only for DNA palindromes. This causes MEME to average the letter frequencies in corresponding motif columns together. For instance, if the width of the motif is 10, columns 1 and 10, 2 and 9, 3 and 8, etc., are averaged together. The averaging combines the frequency of A in one column with T in the other, and the frequency of C in one column with G in the other. If this box is not checked, the columns are not averaged together.
1. Length of the search motif (default: 10): Length of the search motif
2. Expected number of motifs of each type (default: 10): Bernoulli sampler is used as a default in Melina and this parameter must be added indicating an initial "guesstimate" of the total number of motif elements for each motif type. Motif type here means a specific order of an appropriate number of nucleotides.
3. Near optimal cutoff (default: 50%): The Near Optimal Cutoff percent specifies the threshold value (%) for selecting the "good" motifs during the Near optimal sampling. It defines the percent "good" motif must be sampled through the iterations. Near optimal sampling is a process of discarding the pseudomotifs, after the maximal alignment matrix was created. All positions in maximal alignment are marked as "good" and all others positions are being sampled and compared to the "good" positions defined as a threshold until the Max Iterations Number is over. As a result some weakly conserved motifs can be sampled out and some missed motifs can be sampled in.
4. Give seed for random number generator (default: 1000): This number is used as a seed for random numbers generator. Random numbers generator reproduce sufficient amount of numbers which indicate motifs start positions for the creating of initial motifs alignment in each sampling run.
5. Maximum number of sampling runs (number of seeds) (default: 10): This value indicates a number of models must be sampled (one model per each run), i.e. number of seeds used by random numbers generator.
6. Number of iterations between successive local maxima (Plateau Period) (default: 20): After ten sampling steps were repeated per each dataset sequence, program calculates the posterior probability of the obtained alignmnet, which is called as a MAP value. (maximum a posterior) There are four parts to this probability calculation: the motif portion, the background portion, the fragmentation portion and the distribution of the number of motif sites. After that Plateau Period iterations started for the defined number of times. For every Plateau Period iteration MAP value is calculated, and if no improvement in MAP value was observed during 20 (default) iterations, in is suggested that the alignment is stuck in an "energy well" and the alignment will not be improved. If the improvement is observed, sampling continues until the Max Iterations number exceeds. The obtained alignment is reported as maximal alignment.
7. Maximum number of iterations during one sampling run (default: 500): This parameter specifies the maximum number of allowed iterations for each seed as well as the total iterations for Near Optimal sampling.
8. Pseudo count weights (default: 0.1): Background pseudocounts are calculated as a percentage of the background observed counts. The user can determine how much weight the pseudocounts hold by specifying the pseudocount weight. The weight will be in the range between 0 and 1 with a default of 0.1 (10% of the observed counts.)
9. Pseudo site weight (default: 0.8): When specifying the number of sites must be found it is possible to specify prior probabilities on each of the possible number of sites per sequence. Probability of 0, 1, 2, 3, ect. sites to be found per sequence.
Enter, >BLOCKS followed by a list of prior probabilities for each number of sites. For example, if the number of sites is 3,
>BLOCKS 0.05 0.35 0.35 0.25
assigns 5% probability to 0 sites per sequence, 35% to 1 site etc. The values will be normalized, so it is not necessary that they add to 1.
10. DO NOT use fragmentation (default: off): It is possible that an alignment of motifs results in a MAP value that is not quite the maximal value and result in good but not great alignment.
Here fragmentation allows the sampler to look at the current alignment and the surrounding positions to find which columns should be used in the motif rather than just allowing for motifs that are continuous. The width of the field that can be checked is equal to 5 times the width of the current motif type. However, the field is generally greatly narrowed due to the limited flexibility of individual motif elements caused by overlapping ends of sequences or other motifs. Fragmentation goes through the columns that are currently used and picks out the worst column.
11. Use element order in probabilities (site sampler) (default: off): Each sequence must contain the same number of elements arranged in the same order.
12. Undertakes random shuffling of input sequences (default: off): Shuffle the sequences randomly in order to remove any specificity of sequence.
13. DON'T remove protein low complexity regions (default: off): Removes the low complexity regions if the option is not checked.

14. Output wilcoxon rank test information for motif sampler (default: off): In order to assess the significance of a particular alignment, the program can perform a Wilcoxon signed-rank test. Each sequence is shuffled and the collection of shuffled sequences is appended to the input sequences. Motif sampling is performed on the combined data set and a Wilcoxon signed rank test is performed treating the aligned motifs as coming from two paired samples. See Liu, J.S., Neuwald, A.F., and Lawrence, C.E. (1995) for details.

Default Numbers of Sites for each Motif
type of distribution	minimum sites	maximum sites
one occurrence per sequence	n	n
zero or one occurrence per sequence	sqrt(n)	n
any number of repetitions per sequence	sqrt(n)	min(5*n, 50)