Wikiomics:Repeat finding

From OpenWetWare

(Difference between revisions)
Jump to: navigation, search
m (De novo repeat library construction)
(RepeatScout)
Line 61: Line 61:
* PPT presentation presenting algorithm: http://bix.ucsd.edu/repeatscout/repeatscout-ismb.ppt
* PPT presentation presenting algorithm: http://bix.ucsd.edu/repeatscout/repeatscout-ismb.ppt
* publication (PDF)[http://bioinformatics.oxfordjournals.org/cgi/reprint/21/suppl_1/i351.pdf De novo identification of repeat families in large genomes] 2005  
* publication (PDF)[http://bioinformatics.oxfordjournals.org/cgi/reprint/21/suppl_1/i351.pdf De novo identification of repeat families in large genomes] 2005  
 +
* prerequisites
 +
** Perl
 +
** [http://tandem.bu.edu/trf/trf.html Tandem Repeats Finder]  (trg) (accessed 2010-03-22), last version: 4.04
 +
** [ftp://ftp.ncbi.nih.gov/pub/seg/nseg/ nseg] 
 +
Simplest run:
Simplest run:
 +
* build frequency table
<pre>
<pre>
-
build_lmer_table -sequence input_sequence.fas -freq output_lmer.frequency
+
build_lmer_table -sequence input_genome_sequence.fas -freq output_lmer.frequency
-
RepeatScout -sequence input_sequence.fas -output output_repeats -freq  output_lmer.frequency
+
</pre>
</pre>
-
RAM usage: > 17Gb for 800Mb genomic sequence.
+
output_lmer.frequency file can be still quite large (1.7Gb for 900Mb fasta file)
 +
 
 +
* create fasta file containing all kinds of repeats
 +
<pre>
 +
RepeatScout -sequence input_genome_sequence.fas -output output_repeats.fas  -freq output_lmer.frequency
 +
</pre>
 +
 
 +
RAM usage (RepeatScout): > 17Gb for 800Mb genomic sequence.
 +
 
 +
The output (output_repeats.fas) is a fasta file with headers (>R=1, >R=232 etc.). It contains also trivial simple repeats (CACACA...), tandem repeats
 +
 
 +
* filter out short (<50bp) sequences. Remove  "anything that is over 50% low-complexity vis a vis TRF or NSEG.". Perl script.
 +
It does require trg and nseg to be on the PATH, or setting env variables TRF_COMMAND and NSEG_COMMAND pointing to their location
 +
 
 +
<pre>
 +
filter-stage-1.prl output_repeats.fas > output_repeats.fas.filtered
 +
</pre>
 +
 
 +
* run RepeatMasker on your genome of interest  using filtered RepeatScout library
 +
 
 +
<pre>
 +
RepeatMasker  input_genome_sequence.fas -lib output_repeats.fas.filtered
 +
</pre>
 +
 
 +
Output used for the next step:  input_genome_sequence.fas.out
 +
 
 +
* filtering putative repeats by copy number. By default only sequences occuring > 10 times in the genome are kept
 +
 
 +
<pre>
 +
cat repeats.lib | filter-stage-2.prl --cat= input_genome_sequence.fas.out
 +
 
 +
filter-stage-2.prl
 +
</pre>
 +
You can modify tthe filter using i.e. "--thresh=20" (only repeats occuring 20+ times will be kept)

Revision as of 13:16, 22 March 2010

To simplify, this page assumes eucakariotic genomic DNA repeat finding.

Repeat finding can be divided into two tasks, depending on availability of repeat library:

A) Library exists for a given (or possibly closely related species)

or

B) you construct such library de novo.


Task A is usually a prerequisite step for genome annotation and even blast searches. For newly sequences genomes one should start with B (constructing species specific repeat library).


Contents

Detecting known repeats

Most comonly used: Repeatmasker

RepeatMasker


  • Online web server [1]
  • command line

You have to have a FastA file (it can be multiple FastA). Type:

repmask your_sequence_in_fasta_format

You will get a file: your_sequence_in_fasta_format.masked --- name tells all

species options (choose only one):

-m(us) masks rodent specific and mammalian wide repeats
-rod(ent) same as -mus
-mam(mal) masks repeats found in non-primate, non-rodent mammals
-ar(abidopsis) masks repeats found in Arabidopsis
-dr(osophila) masks repeats found in Drosophilas
-el(egans) masks repeats found in C. elegans

De novo repeat library construction

For programs recommendations based on test see: Saha et al. Empirical comparison of ab initio repeat finding programs (2008)

For an extensive review listing tens of programs: Lerat E.Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs (Nov 2009)

RepeatScout

command line only, requires compilation

Site: http://bix.ucsd.edu/repeatscout/

current version (2010-03): 1.05

Documentation:


Simplest run:

  • build frequency table
build_lmer_table -sequence input_genome_sequence.fas -freq output_lmer.frequency

output_lmer.frequency file can be still quite large (1.7Gb for 900Mb fasta file)

  • create fasta file containing all kinds of repeats
RepeatScout -sequence input_genome_sequence.fas -output output_repeats.fas  -freq output_lmer.frequency

RAM usage (RepeatScout): > 17Gb for 800Mb genomic sequence.

The output (output_repeats.fas) is a fasta file with headers (>R=1, >R=232 etc.). It contains also trivial simple repeats (CACACA...), tandem repeats

  • filter out short (<50bp) sequences. Remove "anything that is over 50% low-complexity vis a vis TRF or NSEG.". Perl script.

It does require trg and nseg to be on the PATH, or setting env variables TRF_COMMAND and NSEG_COMMAND pointing to their location

filter-stage-1.prl output_repeats.fas > output_repeats.fas.filtered
  • run RepeatMasker on your genome of interest using filtered RepeatScout library
 RepeatMasker  input_genome_sequence.fas -lib output_repeats.fas.filtered

Output used for the next step: input_genome_sequence.fas.out

  • filtering putative repeats by copy number. By default only sequences occuring > 10 times in the genome are kept
 cat repeats.lib | filter-stage-2.prl --cat= input_genome_sequence.fas.out

filter-stage-2.prl 

You can modify tthe filter using i.e. "--thresh=20" (only repeats occuring 20+ times will be kept)

Personal tools