Wikiomics:Repeat finding
From OpenWetWare
m (→De novo repeat library construction) |
(→RepeatScout) |
||
| Line 61: | Line 61: | ||
* PPT presentation presenting algorithm: http://bix.ucsd.edu/repeatscout/repeatscout-ismb.ppt | * PPT presentation presenting algorithm: http://bix.ucsd.edu/repeatscout/repeatscout-ismb.ppt | ||
* publication (PDF)[http://bioinformatics.oxfordjournals.org/cgi/reprint/21/suppl_1/i351.pdf De novo identification of repeat families in large genomes] 2005 | * publication (PDF)[http://bioinformatics.oxfordjournals.org/cgi/reprint/21/suppl_1/i351.pdf De novo identification of repeat families in large genomes] 2005 | ||
| + | * prerequisites | ||
| + | ** Perl | ||
| + | ** [http://tandem.bu.edu/trf/trf.html Tandem Repeats Finder] (trg) (accessed 2010-03-22), last version: 4.04 | ||
| + | ** [ftp://ftp.ncbi.nih.gov/pub/seg/nseg/ nseg] | ||
| + | |||
Simplest run: | Simplest run: | ||
| + | * build frequency table | ||
<pre> | <pre> | ||
| - | build_lmer_table -sequence | + | build_lmer_table -sequence input_genome_sequence.fas -freq output_lmer.frequency |
| - | + | ||
</pre> | </pre> | ||
| - | RAM usage: > 17Gb for 800Mb genomic sequence. | + | output_lmer.frequency file can be still quite large (1.7Gb for 900Mb fasta file) |
| + | |||
| + | * create fasta file containing all kinds of repeats | ||
| + | <pre> | ||
| + | RepeatScout -sequence input_genome_sequence.fas -output output_repeats.fas -freq output_lmer.frequency | ||
| + | </pre> | ||
| + | |||
| + | RAM usage (RepeatScout): > 17Gb for 800Mb genomic sequence. | ||
| + | |||
| + | The output (output_repeats.fas) is a fasta file with headers (>R=1, >R=232 etc.). It contains also trivial simple repeats (CACACA...), tandem repeats | ||
| + | |||
| + | * filter out short (<50bp) sequences. Remove "anything that is over 50% low-complexity vis a vis TRF or NSEG.". Perl script. | ||
| + | It does require trg and nseg to be on the PATH, or setting env variables TRF_COMMAND and NSEG_COMMAND pointing to their location | ||
| + | |||
| + | <pre> | ||
| + | filter-stage-1.prl output_repeats.fas > output_repeats.fas.filtered | ||
| + | </pre> | ||
| + | |||
| + | * run RepeatMasker on your genome of interest using filtered RepeatScout library | ||
| + | |||
| + | <pre> | ||
| + | RepeatMasker input_genome_sequence.fas -lib output_repeats.fas.filtered | ||
| + | </pre> | ||
| + | |||
| + | Output used for the next step: input_genome_sequence.fas.out | ||
| + | |||
| + | * filtering putative repeats by copy number. By default only sequences occuring > 10 times in the genome are kept | ||
| + | |||
| + | <pre> | ||
| + | cat repeats.lib | filter-stage-2.prl --cat= input_genome_sequence.fas.out | ||
| + | |||
| + | filter-stage-2.prl | ||
| + | </pre> | ||
| + | You can modify tthe filter using i.e. "--thresh=20" (only repeats occuring 20+ times will be kept) | ||
Revision as of 14:16, 22 March 2010
To simplify, this page assumes eucakariotic genomic DNA repeat finding.
Repeat finding can be divided into two tasks, depending on availability of repeat library:
A) Library exists for a given (or possibly closely related species)
or
B) you construct such library de novo.
Task A is usually a prerequisite step for genome annotation and even blast searches. For newly sequences genomes one should start with B (constructing species specific repeat library).
Contents |
Detecting known repeats
Most comonly used: Repeatmasker
RepeatMasker
- web site: http://www.repeatmasker.org/
- current version (checked on 2010-03.22): 3.2.8
- documentation: http://www.repeatmasker.org/webrepeatmaskerhelp.html
- Online web server [1]
- command line
You have to have a FastA file (it can be multiple FastA). Type:
repmask your_sequence_in_fasta_format
You will get a file: your_sequence_in_fasta_format.masked --- name tells all
species options (choose only one):
-m(us) masks rodent specific and mammalian wide repeats -rod(ent) same as -mus -mam(mal) masks repeats found in non-primate, non-rodent mammals -ar(abidopsis) masks repeats found in Arabidopsis -dr(osophila) masks repeats found in Drosophilas -el(egans) masks repeats found in C. elegans
De novo repeat library construction
For programs recommendations based on test see: Saha et al. Empirical comparison of ab initio repeat finding programs (2008)
For an extensive review listing tens of programs: Lerat E.Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs (Nov 2009)
RepeatScout
command line only, requires compilation
Site: http://bix.ucsd.edu/repeatscout/
current version (2010-03): 1.05
Documentation:
- http://bix.ucsd.edu/repeatscout/readme.1.0.5.txt
- PPT presentation presenting algorithm: http://bix.ucsd.edu/repeatscout/repeatscout-ismb.ppt
- publication (PDF)De novo identification of repeat families in large genomes 2005
- prerequisites
- Perl
- Tandem Repeats Finder (trg) (accessed 2010-03-22), last version: 4.04
- nseg
Simplest run:
- build frequency table
build_lmer_table -sequence input_genome_sequence.fas -freq output_lmer.frequency
output_lmer.frequency file can be still quite large (1.7Gb for 900Mb fasta file)
- create fasta file containing all kinds of repeats
RepeatScout -sequence input_genome_sequence.fas -output output_repeats.fas -freq output_lmer.frequency
RAM usage (RepeatScout): > 17Gb for 800Mb genomic sequence.
The output (output_repeats.fas) is a fasta file with headers (>R=1, >R=232 etc.). It contains also trivial simple repeats (CACACA...), tandem repeats
- filter out short (<50bp) sequences. Remove "anything that is over 50% low-complexity vis a vis TRF or NSEG.". Perl script.
It does require trg and nseg to be on the PATH, or setting env variables TRF_COMMAND and NSEG_COMMAND pointing to their location
filter-stage-1.prl output_repeats.fas > output_repeats.fas.filtered
- run RepeatMasker on your genome of interest using filtered RepeatScout library
RepeatMasker input_genome_sequence.fas -lib output_repeats.fas.filtered
Output used for the next step: input_genome_sequence.fas.out
- filtering putative repeats by copy number. By default only sequences occuring > 10 times in the genome are kept
cat repeats.lib | filter-stage-2.prl --cat= input_genome_sequence.fas.out filter-stage-2.prl
You can modify tthe filter using i.e. "--thresh=20" (only repeats occuring 20+ times will be kept)


