Wikiomics:Repeat finding

From OpenWetWare

(Difference between revisions)
Jump to: navigation, search
(De novo repeat library construction)
Current revision (08:06, 2 December 2010) (view source)
(References)
 
(28 intermediate revisions not shown.)
Line 1: Line 1:
-
To simplify, this page assumes eucakariotic genomic DNA repeat finding.   
+
{{stub}}
 +
 
 +
 
 +
To simplify, this page assumes eukaryotic genomic DNA repeat finding.   
Repeat finding can be divided into two tasks, depending on availability of repeat library:
Repeat finding can be divided into two tasks, depending on availability of repeat library:
Line 12: Line 15:
Task A is usually a prerequisite step for genome annotation and even blast searches. For newly sequences genomes one should start with B (constructing species specific repeat library).
Task A is usually a prerequisite step for genome annotation and even blast searches. For newly sequences genomes one should start with B (constructing species specific repeat library).
 +
For more comprehensive list of programs read [http://bergmanlab.smith.man.ac.uk/?page_id=295 TE tools@Bergman Lab, U. of Manchester, UK]
=Detecting known repeats=
=Detecting known repeats=
 +
Most comonly used: Repeatmasker
 +
 +
==RepeatMasker==
 +
 +
* web site: http://www.repeatmasker.org/
 +
* current version (checked on 2010-03.22): 3.2.8
 +
* documentation: http://www.repeatmasker.org/webrepeatmaskerhelp.html
 +
 +
 +
* Online web server [http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker]
 +
 +
* command line   
 +
 +
You have to have a FastA file (it can be multiple FastA). Type:
 +
 +
<pre>
 +
repmask your_sequence_in_fasta_format
 +
</pre>
 +
 +
You will get a file: your_sequence_in_fasta_format.masked --- name tells all
 +
 +
species options (choose only one):
 +
<pre>
 +
-m(us) masks rodent specific and mammalian wide repeats
 +
-rod(ent) same as -mus
 +
-mam(mal) masks repeats found in non-primate, non-rodent mammals
 +
-ar(abidopsis) masks repeats found in Arabidopsis
 +
-dr(osophila) masks repeats found in Drosophilas
 +
-el(egans) masks repeats found in C. elegans
 +
</pre>
 +
 +
 +
==TransposonPSI==
 +
 +
* web site: http://transposonpsi.sourceforge.net/
 +
* latest version: 08-19-2010
 +
"identifies protein or nucleic acid sequence homology to proteins encoded by diverse families of transposable elements using PSI-blast"
 +
 +
Rather slow in sequential mode for larger genomes. Speedups easily obtainable by splitting the target genome/proteome and running it on multiple nodes and/or modifying  transposonPSI.pl script by adding "-a number_of_cores" (number_of_cores -> number, i.e. 4) to two lines executing blast searches. GFF3 output.
=De novo repeat library construction=
=De novo repeat library construction=
-
For review see: Saha et al. [http://nar.oxfordjournals.org/cgi/content/full/gkn064v1 Empirical comparison of ab initio repeat finding programs] (2008)
+
For programs recommendations based on test see: Saha et al. [http://nar.oxfordjournals.org/cgi/content/full/gkn064v1 Empirical comparison of ab initio repeat finding programs] (2008)
-
==RepeatScout==
+
For an extensive reviews listing tens of programs: 
 +
* Bergamn C. [http://bib.oxfordjournals.org/cgi/content/full/8/6/382 Discovering and detecting transposable elements in genome sequences]  (2007)
 +
* Lerat E. [http://www.nature.com/hdy/journal/vaop/ncurrent/full/hdy2009165a.html Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs ]  (Nov 2009)
 +
 
 +
Keep in mind that resulting libraries should be further screened for gene families. There are border cases, where genome may contain thousands of modified copies of a gene, ranging from seemingly functional copies, through pseudogenes, gene fragments and single exons (i.e Speer family in rodents).
 +
 
 +
==Consensus Based==
 +
One has to have at least a draft of the genome or multiple genomic sequences.
 +
 
 +
<!-- Since at least some next gen sequence assemblers (Newbler for 454 data) reject highly over-represented sequences during assembly, the true repeat content of the genome will be biased.
 +
-->
 +
===RepeatScout===
command line only, requires compilation  
command line only, requires compilation  
Line 29: Line 83:
* PPT presentation presenting algorithm: http://bix.ucsd.edu/repeatscout/repeatscout-ismb.ppt
* PPT presentation presenting algorithm: http://bix.ucsd.edu/repeatscout/repeatscout-ismb.ppt
* publication (PDF)[http://bioinformatics.oxfordjournals.org/cgi/reprint/21/suppl_1/i351.pdf De novo identification of repeat families in large genomes] 2005  
* publication (PDF)[http://bioinformatics.oxfordjournals.org/cgi/reprint/21/suppl_1/i351.pdf De novo identification of repeat families in large genomes] 2005  
 +
* prerequisites
 +
** Perl
 +
** [http://tandem.bu.edu/trf/trf.html Tandem Repeats Finder]  (trg) (accessed 2010-03-22), last version: 4.04
 +
** [ftp://ftp.ncbi.nih.gov/pub/seg/nseg/ nseg] 
 +
Simplest run:
Simplest run:
 +
* build frequency table
<pre>
<pre>
-
build_lmer_table -sequence input_sequence.fas -freq output_lmer.frequency
+
build_lmer_table -sequence input_genome_sequence.fas -freq output_lmer.frequency
-
RepeatScout -sequence input_sequence.fas -output output_repeats -freq  output_lmer.frequency
+
</pre>
</pre>
 +
 +
output_lmer.frequency file can be still quite large (1.7Gb for 900Mb fasta file)
 +
 +
* create fasta file containing all kinds of repeats
 +
<pre>
 +
RepeatScout -sequence input_genome_sequence.fas -output output_repeats.fas  -freq output_lmer.frequency
 +
</pre>
 +
 +
Resources:
 +
** RAM usage (RepeatScout): > 17Gb for 800Mb genomic sequence.
 +
** 9.6h  Xeon E7450  @ 2.40GHz
 +
 +
The output (output_repeats.fas) is a fasta file with headers (>R=1, >R=232 etc.). It contains also trivial simple repeats (CACACA...), tandem repeats
 +
 +
* filter out short (<50bp) sequences. Remove  "anything that is over 50% low-complexity vis a vis TRF or NSEG.". Perl script.
 +
It does require trg and nseg to be on the PATH, or setting env variables TRF_COMMAND and NSEG_COMMAND pointing to their location
 +
 +
<pre>
 +
filter-stage-1.prl output_repeats.fas > output_repeats.fas.filtered_1
 +
</pre>
 +
 +
this prints tons of messages
 +
 +
<!--
 +
cat output_repeats.fa | filter-stage-1.prl > output_repeats.fas.filtered_1
 +
 +
filter-stage-1.prl output_repeats.fas > output_repeats.fas.filtered_1 > fltr1_messages.txt
 +
this one works as well, prints tons of messages
 +
-->
 +
 +
 +
* run RepeatMasker on your genome of interest  using filtered RepeatScout library
 +
 +
<pre>
 +
RepeatMasker  input_genome_sequence.fas -lib output_repeats.fas.filtered_1
 +
</pre>
 +
 +
This is a very long step (36h for 800Mb draft genome) when run in such default mode. See discussion for this page for possible, but so far untested speedups.
 +
 +
Output used for the next step:  input_genome_sequence.fas.out
 +
 +
* filtering putative repeats by copy number. By default only sequences occurring > 10 times in the genome are kept
 +
 +
 +
<pre>
 +
cat output_repeats.fas.filtered_1  | filter-stage-2.prl --cat=input_genome_sequence.fas.out > output_repeats.fas.filtered_2
 +
</pre>
 +
 +
Fast (< 1min ). You can modify the filter using i.e. "--thresh=20" (only repeats occurring 20+ times will be kept)
 +
 +
 +
===Piler (with lastz)===
 +
====Prerequisites:====
 +
* lastz from http://www.bx.psu.edu/miller_lab/ (tested version from 2010-Jan-12)
 +
* piler from http://www.drive5.com/piler/ (tested version 1.0)
 +
 +
Also read the manual of pals, in case you need to understand the piler's gff input format: http://www.drive5.com/pals/pals_userguide.htm
 +
Both programs install easily on Ubuntu 9.10.
 +
 +
====Input====
 +
* fasta file. Fasta identifiers should not contain spaces/special characters (stay with alpha-numeric ones)
 +
* things to keep in mind:
 +
** default fasta output of i.e. Newbler needs to be fixed
 +
 +
====Running lastz====
 +
<pre>
 +
lastz my_fasta_input.fa[multiple] my_fasta_input.fa --output=my_fasta_lastz_output.csv \
 +
--format=general:name2,zstart2,end2,score,strand2,name1,zstart1,end1,identity  --notrivial --ambiguousn  --markend
 +
</pre>
 +
 +
====Converting lastz output to gff piler input====
 +
Python script (still in the testing phase)
 +
 +
====Running piler====
 +
Draft!
 +
See: http://www.drive5.com/piler/piler_userguide.html
 +
 +
<pre>
 +
piler -trs my_fasta_lastz_output.gff -out  my_fasta_lastz_output.trs
 +
</pre>
 +
 +
==Input Reads==
 +
This is pre-assembly repeat finding method.
 +
 +
===ReAS (broken install)===
 +
[http://www.ploscompbiol.org/article/info:doi%2F10.1371%2Fjournal.pcbi.0010043 Paper]
 +
====Installation====
 +
* get ftp://ftp.genomics.org.cn/pub/ReAS/software/ReAS_2.02.tar.gz
 +
* unpack it in a suitable directory
 +
<pre>
 +
tar xfvz ReAS_2.02.tar.gz; cd ReAS_2.02/code
 +
</pre>
 +
 +
* fix two files:
 +
open code/N_matchreads.cpp and add line below i.e. after "#include<time.h>":
 +
<pre>
 +
#include <cmath>
 +
</pre>
 +
 +
perltidy finds one error in Clustering.pl. Replace:
 +
<pre>
 +
my $cluster_dir "cluster_"; #prefix of directory names
 +
</pre>
 +
 +
with:
 +
 +
<pre>
 +
my $cluster_dir = "cluster_"; #prefix of directory names
 +
</pre>
 +
 +
 +
* compile ReAS
 +
<pre>
 +
make; make install
 +
</pre>
 +
You will get binaries + perl modules in ReAS_2.02/bin
 +
* Put them on $PATH (bash)
 +
<pre>
 +
export PATH=/your/path/to/ReAS_2.02/bin/:$PATH
 +
</pre>
 +
 +
* you need to have on your PATH
 +
** dust from wublast, unknown version (not tested yet with dustmasker from blast) 
 +
** muscle (3.6 was used originally)
 +
** cross_match.manyreads from phrap
 +
 +
With a large number of reads and "-pa 1" (see below) cross_match (version: 0.990329) crashes with: "FATAL ERROR: seq_area_size".
 +
Untested:
 +
* the newest beta version (1.090518) do not have this memory limit, but also do not have cross_match.manyreads(?)
 +
* in the older version of cross_match one can change #define BLOCK_SIZE 10000 in db.c and recompile
 +
 +
* I decreased the input size for testing (265Mb fasta with 592k 454 reads) which is small enough. Possibly running ReAS with
 +
<pre>
 +
reas_all.pl -read input.fasta -n 8
 +
</pre>
 +
 +
may fix it.
 +
 +
====Usage====
 +
 +
read 00readme located in ReAS_2.02/code for more detailed instructions.
 +
 +
To run  with default settings with already selected set of reads:
 +
<pre>
 +
nohup reas_all.pl -read 454_subset_4repeat_search.fas -output 454_subset_4repeat_search.fas.cons_reas  2 > nohup.reas1.txt &
 +
</pre>
 +
 +
CAVEAT: at the current stage above command does not run all scripts correctly.
 +
 +
 +
====external info====
 +
* There is description how ReAS was used in Hydra magnipapillata genome project (see  supplementary information PDF, section "S9. Analysis of repeated sequences"):
 +
 +
http://www.nature.com/nature/journal/v464/n7288/full/nature08830.html
 +
 +
From 10M+ Sanger sequences of average length 800+ bp they constructed two libraries:
 +
* A) 1% of reads, 17-mers with a depth of at least 10 ("especially enriched in the CR1 class of retrotransposons")
 +
* B) 10% of reads 17-mers depth range of 10 to 100. 
 +
 +
"""
 +
- to improve the quality of the assembly, only reads that have at least 100 high-depth 17-mers were considered
 +
 +
- ReAS was run on each of the libraries separately
 +
 +
- after retaining only the assembled repeats of length larger than 500 nucleotides and a minimal average depth value (as provided by the
 +
program) of 10, the two libraries contained 949 and 25,110 repeats each, respectively.
 +
 +
- These sequences were then pulled together.
 +
 +
- initial ReAS assembly appears to be fragmented and there are many redundant sequences, the final version of the library was produced by
 +
running ReAS' join_fragments.pl and rmRedundance.pl scripts.
 +
 +
- final library contained 3909 reconstructed repetitive elements with an average length of about 1500 nt.
 +
"""
 +
 +
* see this thread @Biostar:
 +
http://biostar.stackexchange.com/questions/753/instaling-reas-repeat-finding
 +
 +
 +
 +
For pages on similar topics visit: [http://openwetware.org/wiki/Wikiomics Wikiomics@OpenWetWare]
 +
 +
=De novo masking without library=
 +
 +
These programs can mask sequence using only sequence itself. 
 +
 +
==Tandem Repeats Finder==
 +
web: http://tandem.bu.edu/trf/trf.html
 +
current version: 4.04 
 +
 +
Options recommended for genome 2 genome alignments in: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2829014/
 +
 +
match = 2 mismatch = 5 delta = 5 PM = 80 PI = 10 minscore = 30 maxperiod = 200 -R
 +
 +
==DustMasker==
 +
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/app/dustmasker/
 +
 +
==WindowMasker==
 +
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/app/winmasker/
 +
 +
==Others==
 +
* seg from wu-blast
 +
 +
=ribosomal RNA detection=
 +
While not "repetitive" in a strict sense, ribosomal sequences need to be detected and preferably masked for i.e. de novo gene prediction or ESTs mapping. There are in Genbank "protein coding genes" predicted from ribosomal sequences, not only as a result of de novo prediction but "prediction with a strong ESTs/RNA-Seq support".
 +
 +
==RNAmmer==
 +
http://www.cbs.dtu.dk/services/RNAmmer/
 +
 +
Wildly used.
 +
 +
==Meta-RNA==
 +
http://tools.camera.calit2.net/camera/meta_rna/
 +
 +
 +
 +
=References=
 +
* Church DM,  Goodstadt L,  Hillier LW,  Zody MC,  Goldstein S,  et al. 2009 Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse. PLoS Biol 7(5): e1000112. doi:10.1371/journal.pbio.1000112
 +
 +
[[Category:Protocol]] [[Category:In silico]] [[Category:Data analysis]]
 +
[[Category:Protocol]] [[Category:In silico]] [[Category:Data analysis]]

Current revision


To simplify, this page assumes eukaryotic genomic DNA repeat finding.

Repeat finding can be divided into two tasks, depending on availability of repeat library:

A) Library exists for a given (or possibly closely related species)

or

B) you construct such library de novo.


Task A is usually a prerequisite step for genome annotation and even blast searches. For newly sequences genomes one should start with B (constructing species specific repeat library).

For more comprehensive list of programs read TE tools@Bergman Lab, U. of Manchester, UK

Contents

Detecting known repeats

Most comonly used: Repeatmasker

RepeatMasker


  • Online web server [1]
  • command line

You have to have a FastA file (it can be multiple FastA). Type:

repmask your_sequence_in_fasta_format

You will get a file: your_sequence_in_fasta_format.masked --- name tells all

species options (choose only one):

-m(us) masks rodent specific and mammalian wide repeats
-rod(ent) same as -mus
-mam(mal) masks repeats found in non-primate, non-rodent mammals
-ar(abidopsis) masks repeats found in Arabidopsis
-dr(osophila) masks repeats found in Drosophilas
-el(egans) masks repeats found in C. elegans


TransposonPSI

"identifies protein or nucleic acid sequence homology to proteins encoded by diverse families of transposable elements using PSI-blast"

Rather slow in sequential mode for larger genomes. Speedups easily obtainable by splitting the target genome/proteome and running it on multiple nodes and/or modifying transposonPSI.pl script by adding "-a number_of_cores" (number_of_cores -> number, i.e. 4) to two lines executing blast searches. GFF3 output.

De novo repeat library construction

For programs recommendations based on test see: Saha et al. Empirical comparison of ab initio repeat finding programs (2008)

For an extensive reviews listing tens of programs:

Keep in mind that resulting libraries should be further screened for gene families. There are border cases, where genome may contain thousands of modified copies of a gene, ranging from seemingly functional copies, through pseudogenes, gene fragments and single exons (i.e Speer family in rodents).

Consensus Based

One has to have at least a draft of the genome or multiple genomic sequences.

RepeatScout

command line only, requires compilation

Site: http://bix.ucsd.edu/repeatscout/

current version (2010-03): 1.05

Documentation:


Simplest run:

  • build frequency table
build_lmer_table -sequence input_genome_sequence.fas -freq output_lmer.frequency

output_lmer.frequency file can be still quite large (1.7Gb for 900Mb fasta file)

  • create fasta file containing all kinds of repeats
RepeatScout -sequence input_genome_sequence.fas -output output_repeats.fas  -freq output_lmer.frequency

Resources:

    • RAM usage (RepeatScout): > 17Gb for 800Mb genomic sequence.
    • 9.6h Xeon E7450 @ 2.40GHz

The output (output_repeats.fas) is a fasta file with headers (>R=1, >R=232 etc.). It contains also trivial simple repeats (CACACA...), tandem repeats

  • filter out short (<50bp) sequences. Remove "anything that is over 50% low-complexity vis a vis TRF or NSEG.". Perl script.

It does require trg and nseg to be on the PATH, or setting env variables TRF_COMMAND and NSEG_COMMAND pointing to their location

 
filter-stage-1.prl output_repeats.fas > output_repeats.fas.filtered_1 

this prints tons of messages


  • run RepeatMasker on your genome of interest using filtered RepeatScout library
 RepeatMasker  input_genome_sequence.fas -lib output_repeats.fas.filtered_1

This is a very long step (36h for 800Mb draft genome) when run in such default mode. See discussion for this page for possible, but so far untested speedups.

Output used for the next step: input_genome_sequence.fas.out

  • filtering putative repeats by copy number. By default only sequences occurring > 10 times in the genome are kept


 cat output_repeats.fas.filtered_1  | filter-stage-2.prl --cat=input_genome_sequence.fas.out > output_repeats.fas.filtered_2

Fast (< 1min ). You can modify the filter using i.e. "--thresh=20" (only repeats occurring 20+ times will be kept)


Piler (with lastz)

Prerequisites:

Also read the manual of pals, in case you need to understand the piler's gff input format: http://www.drive5.com/pals/pals_userguide.htm Both programs install easily on Ubuntu 9.10.

Input

  • fasta file. Fasta identifiers should not contain spaces/special characters (stay with alpha-numeric ones)
  • things to keep in mind:
    • default fasta output of i.e. Newbler needs to be fixed

Running lastz

lastz my_fasta_input.fa[multiple] my_fasta_input.fa --output=my_fasta_lastz_output.csv \ 
--format=general:name2,zstart2,end2,score,strand2,name1,zstart1,end1,identity  --notrivial --ambiguousn  --markend

Converting lastz output to gff piler input

Python script (still in the testing phase)

Running piler

Draft! See: http://www.drive5.com/piler/piler_userguide.html

piler -trs my_fasta_lastz_output.gff -out  my_fasta_lastz_output.trs

Input Reads

This is pre-assembly repeat finding method.

ReAS (broken install)

Paper

Installation

tar xfvz ReAS_2.02.tar.gz; cd ReAS_2.02/code
  • fix two files:

open code/N_matchreads.cpp and add line below i.e. after "#include<time.h>":

#include <cmath>

perltidy finds one error in Clustering.pl. Replace:

my $cluster_dir "cluster_"; #prefix of directory names

with:

my $cluster_dir = "cluster_"; #prefix of directory names


  • compile ReAS
make; make install

You will get binaries + perl modules in ReAS_2.02/bin

  • Put them on $PATH (bash)
export PATH=/your/path/to/ReAS_2.02/bin/:$PATH
  • you need to have on your PATH
    • dust from wublast, unknown version (not tested yet with dustmasker from blast)
    • muscle (3.6 was used originally)
    • cross_match.manyreads from phrap

With a large number of reads and "-pa 1" (see below) cross_match (version: 0.990329) crashes with: "FATAL ERROR: seq_area_size". Untested:

  • the newest beta version (1.090518) do not have this memory limit, but also do not have cross_match.manyreads(?)
  • in the older version of cross_match one can change #define BLOCK_SIZE 10000 in db.c and recompile
  • I decreased the input size for testing (265Mb fasta with 592k 454 reads) which is small enough. Possibly running ReAS with
reas_all.pl -read input.fasta -n 8 

may fix it.

Usage

read 00readme located in ReAS_2.02/code for more detailed instructions.

To run with default settings with already selected set of reads:

nohup reas_all.pl -read 454_subset_4repeat_search.fas -output 454_subset_4repeat_search.fas.cons_reas   2 > nohup.reas1.txt & 

CAVEAT: at the current stage above command does not run all scripts correctly.


external info

  • There is description how ReAS was used in Hydra magnipapillata genome project (see supplementary information PDF, section "S9. Analysis of repeated sequences"):

http://www.nature.com/nature/journal/v464/n7288/full/nature08830.html

From 10M+ Sanger sequences of average length 800+ bp they constructed two libraries:

  • A) 1% of reads, 17-mers with a depth of at least 10 ("especially enriched in the CR1 class of retrotransposons")
  • B) 10% of reads 17-mers depth range of 10 to 100.

""" - to improve the quality of the assembly, only reads that have at least 100 high-depth 17-mers were considered

- ReAS was run on each of the libraries separately

- after retaining only the assembled repeats of length larger than 500 nucleotides and a minimal average depth value (as provided by the program) of 10, the two libraries contained 949 and 25,110 repeats each, respectively.

- These sequences were then pulled together.

- initial ReAS assembly appears to be fragmented and there are many redundant sequences, the final version of the library was produced by running ReAS' join_fragments.pl and rmRedundance.pl scripts.

- final library contained 3909 reconstructed repetitive elements with an average length of about 1500 nt. """

  • see this thread @Biostar:

http://biostar.stackexchange.com/questions/753/instaling-reas-repeat-finding


For pages on similar topics visit: Wikiomics@OpenWetWare

De novo masking without library

These programs can mask sequence using only sequence itself.

Tandem Repeats Finder

web: http://tandem.bu.edu/trf/trf.html current version: 4.04

Options recommended for genome 2 genome alignments in: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2829014/

match = 2 mismatch = 5 delta = 5 PM = 80 PI = 10 minscore = 30 maxperiod = 200 -R

DustMasker

http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/app/dustmasker/

WindowMasker

http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/app/winmasker/

Others

  • seg from wu-blast

ribosomal RNA detection

While not "repetitive" in a strict sense, ribosomal sequences need to be detected and preferably masked for i.e. de novo gene prediction or ESTs mapping. There are in Genbank "protein coding genes" predicted from ribosomal sequences, not only as a result of de novo prediction but "prediction with a strong ESTs/RNA-Seq support".

RNAmmer

http://www.cbs.dtu.dk/services/RNAmmer/

Wildly used.

Meta-RNA

http://tools.camera.calit2.net/camera/meta_rna/


References

  • Church DM, Goodstadt L, Hillier LW, Zody MC, Goldstein S, et al. 2009 Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse. PLoS Biol 7(5): e1000112. doi:10.1371/journal.pbio.1000112
Personal tools