Wikiomics:Sequence motifs: Difference between revisions

Latest revision as of 14:59, 7 June 2010

Sequence motifs - for more general background see Wikipedias Sequence_motif.

Sequence motif finding is used often in contexts of transcription binding sites. Before starting it is important to keep few things in mind:

transcriptional regulation differs in complexity from simple organisms (i.e. yeast) to human. What works in yeast (multiple algorithms on promoters of co-expressed genes) does not work on human data

for any given gene, start with comparative genomics (called here phylogenetic fingerprinting). For 60-70% of promoters you can find common TF binding sites between human and mouse

more species, stricter motif detection, but less sensitivity (gene can be regulated differently in some species, TF binding site outside of fragment used for prediction etc.)

conservation of promoters varies between genes. For some highly conserved ones (i.e. some SOX genes) there is too little of a difference between human and mouse to detect anything. You have to go down the tree to fish species to find island of conservation.

for some more human specific genes there may be no conservation human-mouse, or the orthologue gene may not even exist in one of the species. Use sequences from more closely related species. This goes up to studying multiple sequences from very closely related species (i.e. 12 Drosophila's). Search the term "phylogenetic shadowing"

not all collection of sequences are equally good for sequence motif prediction. Order of preference:
- chromatin immunoprecipitation followed by next generation sequencing
- promoters of ortologues (see above)
- co-expressed genes

pointless: searching for highly degenerate motifs in long stretches of DNA without any statistics (hundreds of sites of almost no value).
- only by comparing scores and frequencies of a given motif in a large set of sequences one can estimate if TF site detected in promoter of interest is of any significance. See PSCAN and "Prediction of over Represented Transcription Factor Binding Sites in Co-regulated Genes Using Whole Genome Matching Statistics" Pavesi 2007

It is important to use few complementary programs as well as allow predictions of several motifs per program. There is threshold when it comes to number of sequences used, above which there is no improvement in sensitivity.

Motif finding programs

There are obvious trade offs between speed and accuracy/sensitivity when running these programs. Few rules based on review by Hu et al.:

avoid using longer sequences than necessary. It does increase noise signal and increases running time.
for programs like MEME there is a plateau of motifs found after about 10 input sequences, so one can randomly select 10 sequences out of a larger set and

then check for occurrence of detected motifs in the remaining ones.

scores between different programs are incompatible

Tested set

This section is few years old and requires retesting.

AlignACE

Gibbs sampling algorithm (stochastic), requires multiple runs to get all top hits.

Web server: [1] Gibbs sampling algorithm

command line C++, requires compilation

to run it:

./AlignACE -i GAL.seq > test.ace

Accessory programs for motif comparisons and motif finding

CompareACE [2] Compares set of found motifs to a database of TFs (in yeast on the web)

tutorial@Harward

MEME

Web: [3]

Command line

meme -nmotifs numberof_motifs_2_find -mod oops  -protein -sequence your_fasta_file -outfile your_fasta_file.meme

Where 'model' could be oops/zoops/anr.

oops: One Occurrence Per Sequence

zoops: Zero or One Occurrence Per Sequence

anr: Any Number of Repetitions

Weeder

./weederlauncher.out your_promoters_file.fasta  MM large A M S t15

Where MM stands for Mus musculus, HS Homo Sapiens atc. See ./FreqFiles/ directory for more.

The output files: your_promoters_file.mix your_promoters_file.wee (output as text) your_promoters_file.html (output summary in HTML)

Bioprospector

Web server: [4]

accepts only FastA files with single sequence line

>sequence1 name
GGTGACGAC sequence1 as ONE LINE
>sequence2 name
GTAGCCTCATG sequence2 as ONE LINE

fixed widith of motifs (default 10, range: 4-50)
binaries for Linux, Sun and Cygwin.
creata background file:

./genomebg.linux -i your_background -o your_background.genomeBG

run BioProspector

./BioProspector.linux -n 200 -d 1 -r 30 -i target_fasta_seqs -f your_background.genomeBG -o outputbiop1

Improbizer

Web takes into account position of the motifs Command line utility called ameme

 ./ameme good=your_test_set.fasta bad=set_of_random_promoter_sequences.fasta numMotifs=10 mo tifOutpu=ameme.out outputLogo background=m2

NestedMica

(java + C program, requires compilation ) [5]

Calculate background:

makemosaicbg -seqs your_input_sequences.fasta -mosaicClasses 1 -mosaicOrder 1 -out your_sequences_background.sbg

Calculate motifs:

motiffinder -seqs your_input_sequences.fasta -backgroundModel your_sequences_background.sbg -numMotifs 2

View the motifs:

motifviewer motifs.xms

Output in xms (an XML variant) format.

MotifExplorer (Java motif viewer compatible with NestedMica): [6]

as for Jan 22nd 2007 works on MacOS. Problems on Windows and Ubuntu Linux.

SOMBRERO

Self Organising Maps [7]

create background model (takec >15min on 35Mbytes fasta file/2.26GHz Pentium) :

perl ./BackExtract.pl -seq your_background_sequences.fasta

Create SOM with motif lenght 8, 10, 12, 14 &16 nucleotides:

./SOMBRERO -t your_target_sequences.fasta -b out.back -lm 8 16 -out target_sombrero.out

View the SOM (requires installation of Tkperl)

perl ./SOMBREROView.pl  target_sombrero.ou

MDScan

uses restricted fasta format!

>sequence1 name
ATGGTGACGAC sequence1 as ONE LINE
>sequence2 name
GTAGCCTCATG sequence2 as ONE LINE

requires sequence background file

./genomebg -i inputSequenceFile -o outputDistributionFile

running it:

./MDscan -i inseq -w 15 -f yeast_all.bg -t 10 -c 80 -r 10 -n 5 -g 1

find motif of width 15 from sequence file inseq, use yeast_all.bg as the background distribution. Find candidate motifs from top 10 sequences, and refine 5 iterations from the top 0 sequences. Report the final best 5 motifs to stdout, and do not print out progress messages on the way.

YMF (not working)

C++ program, requires compilation
running it:

./stats stats.config 800 6 ../ymftables/yeast -sort ../examples/abf/abf1 ../examples/abf/cha1

segfaults.

DIPS

Obtainable from author. Requires compilation

example ( ~50 Kbp takes 70mins)

dmotif -positive positive.fna -negative negative.fna -len 9 -bkg fly_background.fna -niter 5 -nmotif 1 > dmotifoutput 2> dmotiflog

Change '-nmotif 1' to a '-nmotif 10' if you want to get top ten motifs.

Novel algorithms

Cosmo (web, R package, and standalone C program) http://cosmoweb.berkeley.edu/ PDF

more sensitive than MEME according to authors

The AMADEUS Motif Discovery Tool (whole platform)

http://acgt.cs.tau.ac.il/amadeus/

GAME (java, genetic algorithm) [8] (authors page)

paper http://bioinformatics.oxfordjournals.org/cgi/content/abstract/22/13/1577

MotifCut (maximum density subgraphs) [9]

Paper: http://bioinformatics.oxfordjournals.org/cgi/reprint/22/14/e150 motif lenght: fixed, between 6,31

GibbsST [10] Not working yet

Paper(PDF): [11]

THEME (ChIP-chip only??)

http://bioinformatics.oxfordjournals.org/cgi/content/abstract/22/4/423

Ortholog sequences

PhyME [12]

Paper: [13]

PhyloGibbs [14]

Paper: [15]

PhyloCon [16]

Paper: [17]

Multiple algorithms / metaservers

MotifVoter PDF 11! algorithms from National University of Singapore. Web server, result by email.

Credo [18] (broken link)

Visualisation of AlignACE, DIALIGN, FootPrinter, MEME and MotifSampler results. Paper: [19]

BEST

Download: [20] (Qt on linux) BEST includes four commonly used motif-finding programs: AlignACE, BioProspector, CONSENSUS and MEME, as well as the optimization program BioOptimizer. Paper: [21]

Multifinder http://the_brain.bwh.harvard.edu/MultiFinderSuppl/ (download)
RgS-Miner (web, as of 2007.05.22: uses gene list but not sequences yet) http://rgsminer.csie.ncu.edu.tw/

Combining/comparing motifs

MatCompare
- set of user provided PWMs: [22]
- single PWM against TRANSFAC or JASPAR [23]
- simple correlation vlue between each apir of motifs

CompareACE [24] supporting program for AlignAce

STAMP [25]

Handles 12 various output formats from a wide range of motif finding programs. Compares these motifs to known TFBS from JASPAR and other, also user-defined databases of motifs. Converts the output to intermediate format accessible on the server under "X motifs loaded".

APIs

Cistematic http://cistematic.caltech.edu/

Python package with interfaces to i.e. MEME, AlignACE, Co-Bind, and FootPrinter Paper [26]

TAMO http://fraenkel.mit.edu/TAMO/

Python/C++ package with Interfaces to MDscan, AlignACE, and MEME. Paper: [27]

Databases

ORegAnno database as a dynamic collection of literature-curated regulatory regions, transcription factor binding sites and regulatory mutations (polymorphisms and haplotypes).

Paper: [28]

To Read

Tompa review of 14 programs Nat Biotech. 2005 g[29]
Maximilian Haeussler's master thesis and his Tregwiki @Openwetware:
Comparison of several programs: Hu NAR 2005 [30]
Erich Schwarz's list from 2002
Motif Tool Assessment Platform (MTAP) wiki from Omaha
Giulio Pavesi and Federico Zambelli, “Prediction of over Represented Transcription Factor Binding Sites in Co-regulated Genes Using Whole Genome Matching Statistics,” in Applications of Fuzzy Sets Theory, 2007, 651-658, http://dx.doi.org/10.1007/978-3-540-73400-0_83.

External courses/tutorials

EMBL 2001 [31]
Harward (AlignAce) [32]

stuff to incorporate

FOOTER and FOOTER paper

Melina II Metaserver (uses four out of five programs: CONSENSUS, MEME, Gibbs Sampler, MDScan and Weeder) by Kenta Nakai @Tokyo University. Paper

Credits

Template:Credits

Darek Kedra wrote this tutorial

This article is a stub. You can help OpenWetWare by expanding it.

@@ Line 1: / Line 1: @@
 '''Sequence motifs''' - for more general background see Wikipedias [http://en.wikipedia.org/wiki/Sequence_motif Sequence_motif].
+Sequence motif finding is used often in contexts of transcription binding sites.
+Before starting it is important to keep few things in mind:
+* transcriptional regulation differs in complexity from simple organisms (i.e. yeast) to human. What works in yeast (multiple algorithms on promoters of co-expressed genes) does not work on human data
+* for any given gene, start with comparative genomics (called here phylogenetic fingerprinting). For 60-70% of promoters you can find common TF binding sites between human and mouse
+* more species, stricter motif detection, but less sensitivity (gene can be regulated differently in some species, TF binding site outside of fragment used for prediction etc.)
+* conservation of promoters varies between genes. For some highly conserved ones (i.e. some SOX genes) there is too little of a difference between human and mouse to detect anything. You have to go down the tree to fish species to find island of conservation.
+* for some more human specific genes there may be no conservation human-mouse, or the orthologue gene may not even exist in one of the species. Use sequences from more closely related species. This goes up to studying multiple sequences from very closely related species (i.e. 12 Drosophila's). Search the term "phylogenetic shadowing"
+* not all collection of sequences are equally good for sequence motif prediction. Order of preference:
+** chromatin immunoprecipitation followed by next generation sequencing
+** promoters of ortologues (see above)
+** co-expressed genes
+* pointless: searching for highly degenerate motifs in long stretches of DNA without any statistics (hundreds of sites of almost no value).
+** only by comparing scores and frequencies of a given motif in a large set of sequences one can estimate if TF site detected in promoter of interest is of any significance. See [http://159.149.109.9/pscan/ PSCAN] and "Prediction of over Represented Transcription Factor Binding Sites in Co-regulated Genes Using Whole Genome Matching Statistics" Pavesi 2007
@@ Line 7: / Line 30: @@
 ==Motif finding programs==
 There are obvious trade offs between speed and accuracy/sensitivity when running these programs. Few rules based on review by Hu et al.:
-* avoid using longer sequencess than necessary. It does increase noise signal and increases running time.
+* avoid using longer sequences than necessary. It does increase noise signal and increases running time.
-* for programs like MEME there is a plateu of motifs found after about 10 input sequences, so one can randomly select 10 sequences out of a larger set and
+* for programs like MEME there is a plateau of motifs found after about 10 input sequences, so one can randomly select 10 sequences out of a larger set and
-then check for occurence of detected motifs in the remaining ones.
+then check for occurrence of detected motifs in the remaining ones.
 * scores between different programs are incompatible
 ==Tested set==
+This section is few years old and requires retesting.
 ===AlignACE===
 Gibbs sampling algorithm (stochastic), requires multiple runs to get all top hits.
@@ Line 185: / Line 210: @@
 * The AMADEUS Motif Discovery Tool (whole platform)
-http://www.cs.tau.ac.il/~rshamir/amadeus/Amadeus.htm
+http://acgt.cs.tau.ac.il/amadeus/
-* GAME (java, genetic algorithm) [http://mail.med.upenn.edu/~zhiwei/GAME/]
+* GAME (java, genetic algorithm) [http://www-stat.wharton.upenn.edu/~stjensen/doc.html] (authors page)
 paper http://bioinformatics.oxfordjournals.org/cgi/content/abstract/22/13/1577
@@ Line 211: / Line 236: @@
 ==Multiple algorithms / metaservers==
-* [http://www.comp.nus.edu.sg/~bioinfo/MotifVoter/ MotifVoter] [http://precedings.nature.com/documents/1251/version/1/files/npre20071251-1.pdf PDF] 10! algorithms from National University of Singapore. Web server, result by email.
+* [http://compbio.ddns.comp.nus.edu.sg/~edward/MotifVoter2/  MotifVoter] [http://precedings.nature.com/documents/1251/version/1/files/npre20071251-1.pdf PDF] 11! algorithms from National University of Singapore. Web server, result by email.
-* Credo [http://mips.gsf.de/proj/regulomips/credo.htm]
+* Credo [http://mips.gsf.de/proj/regulomips/credo.htm] (broken link)
 Visualisation of AlignACE, DIALIGN, FootPrinter, MEME and MotifSampler results.
 Paper: [http://bioinformatics.oxfordjournals.org/cgi/content/full/21/23/4304]
@@ Line 258: / Line 283: @@
 * [http://bioinformatics.caltech.edu/cis-analysis.txt Erich Schwarz's list from 2002]
 * [http://biobase.ist.unomaha.edu/mediawiki/index.php/Main_Page Motif Tool Assessment Platform (MTAP)] wiki from Omaha
+* Giulio Pavesi and Federico Zambelli, “Prediction of over Represented Transcription Factor Binding Sites in Co-regulated Genes Using Whole Genome Matching Statistics,” in Applications of Fuzzy Sets Theory, 2007, 651-658, http://dx.doi.org/10.1007/978-3-540-73400-0_83.
 See also: http://www.connotea.org/user/darked89/tag/motif