parameter vocabulary page

go to the top page

CONSENSUS

The CONSENSUS is a method that identifies the recognition pattern for a DNA-binding protein given only a collection of sequenced DNA fragments. Information about the position and orientation of the binding sites within the fragments is not needed. The method compares the "information content" of a large number of possible binding sites alignments to arrive at a matrix representation of the binding site pattern.
The specificity of the protein is represented as a matrix and a consensus sequence, allowing patterns that are typical of regulatory protein-binding sites to be identified. The reliability of the method improves as the number of sequences increases, but the time required also increases linearly with the number of sequences.

 References: G.Z. Hertz and G.D. Stormo, Bioinformatics, 15, pp.563-577 (1999), 
home pages: http://ural.wustl.edu/
abstract: available here
consensus options setting


1. Length of the search motif (default: 10nt.)
Length of the search motif.


2. Number of matrix to save (default: 1000)
The maximum number of matrices to save is determined here. In practice less matrices are ultimately saved because many of the matrices initially saved are identical to each other. The number of top matrices to be output is defined by user below.


3. Number of cycles if 0 or more motifs per sequence are required
Repeat the matrix building cycle a maximum of "integer" times and allow each sequence to contribute zero or more motifs per matrix. The default finds one motif per sequence. The higher option value the larger the number of motifs found in the dataset, even this option does not guarantee that motifs would be extracted from all sequences. Option 3 is mutually exclusive with option 4.


4. Number of cycles if 1 or more motifs per sequence are required
Repeat the matrix building cycle a maximum of "integer" times and allow each sequence to contribute one or more motifs per matrix. The default looks for one motif per sequence. The higher the option's value, the more motifs in each each sequence could be found.


5. Minimum distance between found motifs
The minimum distance between the starting points of motifs within the same matrix pattern. The value must be a positive integer. Default is defined to be equal to the length of the search motif. If both strands are treated as as a single sequence the option also indicates the minimum distance between a start of the motif and the end of a motif on the complementary strand (i.e., orientation unknown). This option can only be used in the combination with options 3 and 4.


6. Number of cycles after the most significant alignment
Terminate the program "integer" cycles after the current most significant alignment is identified. In the default mode program terminates only when the maximum number of matrix building cycles is completed and the best alignment obtained from all sequences. If you use this option, the program stops after the Top Alignment Matrix is obtained, i.e. not all sequences would be involved in the construction of alignment matrix. This parameter is very effective in you are not sure that the motif exists in all sequences.


7. Use designated prior frequencies
Use the designated prior probabilities of the letters to override the observed frequencies. By default, the program uses the frequences observed in your own sequence data for the prior probabilities of the letters. However, if this option is setup, it is assumed that the letters in the motifs' background sequences are independent and identically distributed. For alignments containing very few sequences, the results are more accurate if the dataset current nucleotide distribution is used.


8. Seed with the first sequence and proceed linearly through the list
Seed with the first sequence and proceed linearly though the list. The option results in a significant speed up of the program, but each of the k-words of the first sequence are used as an initial set of "interesting matrix" and the result will be very biased to the content of the first sequence.

9. Options for handling the complement of nucleic acid sequences the four options in this section are mutually exclusive

1. ignore the complement (default)
2. include both strands as separate sequences
3. include both strands as a single sequence (i.e., orientation unknown)
4. assume that pattern is symmetrical

10. Save the top progeny for each parental matrix (default)

Try to save the top progeny matrices for each parental matrix. This option prevents a strong pattern found in only a subset of sequences from overwhelming the algorithm and elimination of other potential patterns. This undesirable situation can occur when a subset of the sequences share an evolutionary relationship not common to the majority of the sequences.


11. Save the top progeny matrices regardless of parentage.
Do not save top progeny matrices.



About the OUTPUT

The program prints two different lists of matrices

12. Number of top matrices to print

The number of matrices to print of the top matrices from each cycle. A negative value means print all the top matrices.

This first list contains the matrices having the highest information content from each cycle in decreasing statistical significance order (i.e., increasing expected frequency). In general, this first list will contains the most interesting alignment.


13. Number of final matrices to print

The number of matrices to print of the matrices saved from the final cycle. Print no matrices when options 3 or 4 are used.

The second list contains the matrices saved after the final cycle of the program, also in decreasing statistical significance order. Generally, this latter list will be useful when the user wishes each sequence to contribute exactly one word to the final alignment (i.e., when options 3 and 4 are not used).








MEME

Multiple EM for Motif Elicitation

MEME discovers one or more motifs in a collection of DNA sequences by using the technique of expectation maximization (EM) to fit a two-component finite mixture model to the set of sequences. The algorithm estimates how many times each motif occurs in each sequence in the datasets and outputs the alignment of the occurences of the motif. Patterns with variable-length are splitted by MEME into two or more motifs. MEME uses statistical modeling techniques to automatically choose the best width, number of occurrences and description for each motif.
home pages:    http://meme.sdsc.edu/meme/
      abstract:    available here
meme options setting


1. Length of the search motif
Length of a single motif. MEME chooses the optimal length of each motif individually using a statistical heuristic function. You can choose different limits for the minimum and maximum motif length that MEME will consider. The length of each motif that MEME reports will lies within the limits you choose.

2. Number of sites at which motif must occurs
This is the total number of sites in the data set where a single motif occurs. If you have a prior knowledge about the number of occurrences of the motif in your data set, limiting MEME's search in this way can increase the likelihood of MEME finding true motifs. For example, if you know that each motif is likely to occur at least 5 times but no more than 8 times in the training set, you could specify:
Minimum sites = 5
Maximum sites = 8
MEME may still find motifs with slightly fewer or more occurrences then those you specify. In the above example, if there is a motif in the training set with only 4 occurrences, MEME may still find it, but it will report 5 occurrences, one of which will be erroneous. Likewise, if a motif in the training set has 9 occurrences, MEME will probably still find it, but it will report only 8 of its occurrences. This option can not be used in the case you need to find "one occurence per sequence" or "zero or one occurence per sequence". The default values are reported below:

Default Numbers of Sites for each Motif
type of distribution minimum sites maximum sites
one occurrence per sequence n n
zero or one occurrence per sequence sqrt(n) n
any number of repetitions per sequence sqrt(n) min(5*n, 50)


3. Maximum number of motifs to find
MEME will look for up to this number of distinct motifs in the training set. MEME will stop when this number of motifs has been found, or when none can be found with E-value less than 100.


4. Motif distribution menu
1. If you choose the first option, MEME assumes that each sequence in the dataset contains exactly one occurrence of each motif. This option is the fastest and most sensitive but the motifs returned by MEME may be "blurry", if any of the sequences is missing them.

2. If you choose the second option, MEME assumes that each sequence may contain at most one occurrence of each motif. This option is useful when you suspect that some motifs may be missing from some of the sequences. In that case, the motifs found will be more accurate than using the first option. This option takes more computer time than the first option (about twice as much) and is slightly less sensitive to weak motifs present in all of the sequences.

3. If you choose the third option, MEME assumes each sequence may contain any number of non-overlapping occurrences of each motif. This option is useful when you suspect that motifs repeat multiple times within a single sequence. In that case, the motifs found will be much more accurate than using one of the other options. This option can also be used to discover repeats within a single sequence. This option takes the much more computer time than the first option (about ten times as much) and is somewhat less sensitive to weak motifs which do not repeat within a single sequence than the other two options.


5. Stop if motif E-value is greater then user-defined value
Here you can decide the E-value for the expected frequency of the search motif. Algorithm stops if it becomes higher then the defined number. Default is 0.01.


6. Number of Expectation Maximization algorithm runs per the applied dataset
Algorithm stops running after the defined number of iterations is exceeded. Default is 50 runs.


7. Motif trimmimg using multiple alignment
After the algorithm created the final alignment it adds a half of the motifs's length to the both sides of the motif and investigate each position for the possibility to be better than the present position in the alignment. It is called as a - trimming mode. In that case three options below are also used. If you uncheck the box, this procedure will not be undertaken.


8. Gap opening cost for multiple alignments
If -trimming mode (the above option) is used you need to set up the gap opening cost. Default is 11.

9. Gap extension cost for multiple alignments
If -trimming mode (the above option) is used you need to set up the gap extention cost. Default is 1.

10. Do not count end gaps in multiple alignments
In the default algorithm does not consider the end gaps cost for the total alignment score calculation. But, if you uncheck this option, it will.


11. Use complementary stard for search

As a rule MEME searches for motifs on both the given DNA strands and the reverse complement strand by default. Unchecking this box will cause MEME to search the given DNA strand only.


12. Force palindroms

Checking this box causes MEME to search only for DNA palindromes. This causes MEME to average the letter frequencies in corresponding motif columns together. For instance, if the width of the motif is 10, columns 1 and 10, 2 and 9, 3 and 8, etc., are averaged together. The averaging combines the frequency of A in one column with T in the other, and the frequency of C in one column with G in the other. If this box is not checked, the columns are not averaged together.








GIBBS sampler

The Gibbs sampler stochastically examines candidate alignments in an effort to find the best alignment as measured by the maximum a posteriori (MAP) log-likelihood ratio. This algorithm finds an optimized local alignment model for N sequences in N linear time, and allows the simultaneous detection and optimization of multiple patterns and pattern repeats. Usually Gibbs sampler exists in two modes: Bernuolli sampler and Site sampler. The first one proposes an initial "quesstimate" about the number of elements of each motif type, although say nothing about the distribution of the elements through the dataset. In the case of Site sampler each sequence must contains one motif. Melina provides Bernoulli sampler as a default.
 References: Lawrence, C.E., Altschul, S.F., Bogouski, M.S., Liu, J.S., Neuwald, A.F., and Wooten, J.C., Science, 262, 
pp.208-214, (1993)
home pages: http://bayesweb.wadsworth.org/gibbs/ abstract: available here
gibbs sampler options setting



1. Length of the search motif (default: 10)
Length of the search motif


2. Expected number of motifs of each type (default: 10)
Bernoulli sampler is used as a default in Melina and this parameter must be added indicating an initial "guesstimate" of the total number of motif elements for each motif type. Motif type here means a specific order of an appropriate number of nucleotides.


3. Near optimal cutoff (default: 50%)
The Near Optimal Cutoff percent specifies the threshold value (%) for selecting the "good" motifs during the Near optimal sampling. It defines the percent "good" motif must be sampled through the iterations. Near optimal sampling is a process of discarding the pseudomotifs, after the maximal alignment matrix was created. All positions in maximal alignment are marked as "good" and all others positions are being sampled and compared to the "good" positions defined as a threshold until the Max Iterations Number is over. As a result some weakly conserved motifs can be sampled out and some missed motifs can be sampled in.


4. Give seed for random number generator (default: 1000)
This number is used as a seed for random numbers generator. Random numbers generator reproduce sufficient amount of numbers which indicate motifs start positions for the creating of initial motifs alignment in each sampling run.


5. Maximum number of sampling runs (number of seeds) (default: 10)
This value indicates a number of models must be sampled (one model per each run), i.e. number of seeds used by random numbers generator.



6. Number of iterations between successive local maxima (Plateau Period) (default: 20)
After ten sampling steps were repeated per each dataset sequence, program calculates the posterior probability of the obtained alignmnet, which is called as a MAP value. (maximum a posterior) There are four parts to this probability calculation: the motif portion, the background portion, the fragmentation portion and the distribution of the number of motif sites. After that Plateau Period iterations started for the defined number of times. For every Plateau Period iteration MAP value is calculated, and if no improvement in MAP value was observed during 20 (default) iterations, in is suggested that the alignment is stuck in an "energy well" and the alignment will not be improved. If the improvement is observed, sampling continues until the Max Iterations number exceeds. The obtained alignment is reported as maximal alignment.


7. Maximum number of iterations during one sampling run (default: 500)
This parameter specifies the maximum number of allowed iterations for each seed as well as the total iterations for Near Optimal sampling.


8. Pseudo count weights (default: 0.1)
Background pseudocounts are calculated as a percentage of the background observed counts. The user can determine how much weight the pseudocounts hold by specifying the pseudocount weight. The weight will be in the range between 0 and 1 with a default of 0.1 (10% of the observed counts.)


9. Pseudo site weight (default: 0.8)

When specifying the number of sites must be found it is possible to specify prior probabilities on each of the possible number of sites per sequence. Probability of 0, 1, 2, 3, ect. sites to be found per sequence.
Enter, >BLOCKS followed by a list of prior probabilities for each number of sites. For example, if the number of sites is 3,

>BLOCKS 0.05 0.35 0.35 0.25

assigns 5% probability to 0 sites per sequence, 35% to 1 site etc. The values will be normalized, so it is not necessary that they add to 1.


10. DO NOT use fragmentation (default: off)

It is possible that an alignment of motifs results in a MAP value that is not quite the maximal value and result in good but not great alignment.
Here fragmentation allows the sampler to look at the current alignment and the surrounding positions to find which columns should be used in the motif rather than just allowing for motifs that are continuous. The width of the field that can be checked is equal to 5 times the width of the current motif type. However, the field is generally greatly narrowed due to the limited flexibility of individual motif elements caused by overlapping ends of sequences or other motifs. Fragmentation goes through the columns that are currently used and picks out the worst column.


11. Use element order in probabilities (site sampler) (default: off)

Each sequence must contain the same number of elements arranged in the same order.


12. Undertakes random shuffling of input sequences (default: off)

Shuffle the sequences randomly in order to remove any specificity of sequence.


13. DON'T remove protein low complexity regions (default: off)

Removes the low complexity regions if the option is not checked.



14. Output wilcoxon rank test information for motif sampler (default: off)

In order to assess the significance of a particular alignment, the program can perform a Wilcoxon signed-rank test. Each sequence is shuffled and the collection of shuffled sequences is appended to the input sequences. Motif sampling is performed on the combined data set and a Wilcoxon signed rank test is performed treating the aligned motifs as coming from two paired samples. See Liu, J.S., Neuwald, A.F., and Lawrence, C.E. (1995) for details.









CORESEARCH

CORESEARCH is a program for identifying potential functional elements like protein binding sites in DNA sequences, solely from nucleotide sequence data. The algorithm is based on a search for n-tuples (number of motif elements), which occur at least in a minimum percentage of the sequences with no or one mismatch, which may be at any position of the motif. In contrast to functional motifs, random motifs show no preferred pattern of mismatch locations within the motif or in the conservation extended beyond the motif. Selection is carried out by maximization of the information content first for the n-tuple motif, then for a region containing the motif and finally for the complete binding site.

 References: Wolfersetter, F., Kornelie, F., Hermann, G., and Werner, T., CABIOS,12, pp.71-81, (1996)
home pages: http://www.gsf.de/biodv/coresearch.html abstract: available here
coreserch options setting

1. Length of the search motif
Length of the search motif. Default is 7nt.


2. Conservation of sequences in which a motif must occurs
Percentage of sequences from total number in which the motif must occurs. The default here is 90%.


3. Number of highly conserved positions and allowed distances permitted in a motif
Number of highly conserved positions the motif must have followed by the allowed distances. ? mark represents all possible distances.
Example: 3|2,3|1,2,3 means that there must be 3 highly conserved positions. The distance between the first and the secondconserved nucleotides must be 2 or 3, and the distance between the second and third ones must be 1, 2 or 3. Usually an uneven distribution of mismatch location is preferable. So, this option provides the possibility to design 1MM2MM3


4. Percentage from the total number of sequences in which motifs with highly conserved positions must occur
This parameter is closely related to the above one. It demands a percent of sequences in which the above designed motifs should be found. 90% is used as a default value.


5. Maximum number of motif sets for optimal calculation method
CORESEARCH constructs motif sets for each basic motif. Motif sets are collected in an exhaustive manner by combining each match of each sequence with the each match to the other sequences in order to determine a motif set with the highest consensus index of the included motifs. 100 motif sets are used as a default.


6. Minimum motif set conservation to select a motif
In order to construct the best motifs sets CORESEARCH matches every newly found motif to the subset of motifs. If the conservation drops down from the user-defined threshold, the motif is discarded from the subset. The default conservation level defined as 95%.


7. Maximum number of the best motif sets to select for each motif
This option determines a number of sets that should be created for each motif to choose the best one among them. 5 sets are used as a default here.


Calculation of region consensus index

Identical motifs found in the same sequence are distinguished by their positions in the sequence. Since the biologically correct positions of these identical tuples cannot be identified solely from the information inside the motif flanking bases must be taken into account. Each position set is used for an alignment employing the weight corrected alignment algorithm described in Frech et. al. (1993). From each original set only the top scoring combination is kept. Region consensus index is obtained for all identical motifs in combination with all possible motifs in other sequences in dataset. Individual positions of each set are used as anchors for the alignment, and motif members who do not match the extended consensus (up to 25 nucleotides) of all other members are rejected. This results in a complete consensus description for the binding site, which contains the relative frequency gap at each position.

8. Length of region left of motif sets to select for each motif.
Left flanking region of user-defined number of nucleotides long is analyzed for similarity. Region of 30nt is a default here.

9. Length of region right of motif sets to select for each motif.
Right flanking region of user-defined number of nucleotides long is analyzed for similarity. Region of 30nt is a default here.

10. Maximum number of position sets per set for optimum calculations
This parameter defines the possible number of considered position sets for each motif set. As a default 20 position sets are assumed.


11. Minimum region conservation for a position set.
Conservation of a position set for each motif set. 97.5% conservation from the maximum region conservation of its tuple is assumed as a default value.

12. Minimum similarity of motif with consensus motif to be treated as a consensus.
Option defines the minimum similarity of every newly found motif with the consensus motif to be termed as a consensus. 0.8 motif elements position similarity is used here as the default.

13. Minimum similarity of motif with consensus core string to be included in a list of possible cores
All sequences are analyzed for motif elements whose position is similar to the position of elements in the position set of consensus motif, which is called as a consensus core string. 0.7 elements position similarity is used as a default here.