Revision as of 02:45, 9 April 2010

This article is a stub. You can help OpenWetWare by expanding it.

To simplify, this page assumes eukaryotic genomic DNA repeat finding.

Repeat finding can be divided into two tasks, depending on availability of repeat library:

A) Library exists for a given (or possibly closely related species)

or

B) you construct such library de novo.

Task A is usually a prerequisite step for genome annotation and even blast searches. For newly sequences genomes one should start with B (constructing species specific repeat library).

Detecting known repeats

Most comonly used: Repeatmasker

RepeatMasker

web site: http://www.repeatmasker.org/
current version (checked on 2010-03.22): 3.2.8
documentation: http://www.repeatmasker.org/webrepeatmaskerhelp.html

Online web server [1]

command line

You have to have a FastA file (it can be multiple FastA). Type:

repmask your_sequence_in_fasta_format

You will get a file: your_sequence_in_fasta_format.masked --- name tells all

species options (choose only one):

-m(us) masks rodent specific and mammalian wide repeats
-rod(ent) same as -mus
-mam(mal) masks repeats found in non-primate, non-rodent mammals
-ar(abidopsis) masks repeats found in Arabidopsis
-dr(osophila) masks repeats found in Drosophilas
-el(egans) masks repeats found in C. elegans

De novo repeat library construction

For programs recommendations based on test see: Saha et al. Empirical comparison of ab initio repeat finding programs (2008)

For an extensive review listing tens of programs: Lerat E.Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs (Nov 2009)

Keep in mind that resulting libraries should be further screened for gene families. There are border cases, where genome may contain thousands of modified copies of a gene, ranging from seemingly functional copies, through pseudogenes, gene fragments and single exons (i.e Speer family in rodents).

Consensus Based

One has to have at least a draft of the genome or multiple genomic sequences.

RepeatScout

command line only, requires compilation

Site: http://bix.ucsd.edu/repeatscout/

current version (2010-03): 1.05

Documentation:

http://bix.ucsd.edu/repeatscout/readme.1.0.5.txt
PPT presentation presenting algorithm: http://bix.ucsd.edu/repeatscout/repeatscout-ismb.ppt
publication (PDF)De novo identification of repeat families in large genomes 2005
prerequisites
- Perl
- Tandem Repeats Finder (trg) (accessed 2010-03-22), last version: 4.04
- nseg

Simplest run:

build frequency table

build_lmer_table -sequence input_genome_sequence.fas -freq output_lmer.frequency

output_lmer.frequency file can be still quite large (1.7Gb for 900Mb fasta file)

create fasta file containing all kinds of repeats

RepeatScout -sequence input_genome_sequence.fas -output output_repeats.fas  -freq output_lmer.frequency

Resources:

- RAM usage (RepeatScout): > 17Gb for 800Mb genomic sequence.
- 9.6h Xeon E7450 @ 2.40GHz

The output (output_repeats.fas) is a fasta file with headers (>R=1, >R=232 etc.). It contains also trivial simple repeats (CACACA...), tandem repeats

filter out short (<50bp) sequences. Remove "anything that is over 50% low-complexity vis a vis TRF or NSEG.". Perl script.

It does require trg and nseg to be on the PATH, or setting env variables TRF_COMMAND and NSEG_COMMAND pointing to their location

 
filter-stage-1.prl output_repeats.fas > output_repeats.fas.filtered_1

this prints tons of messages

run RepeatMasker on your genome of interest using filtered RepeatScout library

 RepeatMasker  input_genome_sequence.fas -lib output_repeats.fas.filtered_1

This is a very long step (36h for 800Mb draft genome) when run in such default mode. See discussion for this page for possible, but so far untested speedups.

Output used for the next step: input_genome_sequence.fas.out

filtering putative repeats by copy number. By default only sequences occurring > 10 times in the genome are kept

 cat output_repeats.fas.filtered_1  | filter-stage-2.prl --cat=input_genome_sequence.fas.out > output_repeats.fas.filtered_2

Fast (< 1min ). You can modify the filter using i.e. "--thresh=20" (only repeats occurring 20+ times will be kept)

Input Reads

This is pre-assembly repeat finding method.

ReAS

Paper

Installation

get ftp://ftp.genomics.org.cn/pub/ReAS/software/ReAS_2.02.tar.gz
unpack it in a suitable directory

tar xfvz ReAS_2.02.tar.gz; cd ReAS_2.02/code

open N_matchreads.cpp and add line below i.e. after "#include<time.h>":

#include <cmath>

compile ReAS

make; make install

You will get binaries + perl modules in ReAS_2.02/bin

Put them on $PATH (bash)

export PATH=/your/path/to/ReAS_2.02/bin/:$PATH

Usage

For time being:

read 00readme located in ReAS_2.02/code

For pages on similar topics visit: Wikiomics@OpenWetWare

References

Church DM, Goodstadt L, Hillier LW, Zody MC, Goldstein S, et al. 2009 Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse. PLoS Biol 7(5): e1000112. doi:10.1371/journal.pbio.1000112

@@ Line 52: / Line 52: @@
 For an extensive review listing tens of programs:  Lerat E.[http://www.nature.com/hdy/journal/vaop/ncurrent/full/hdy2009165a.html Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs ]  (Nov 2009)
+Keep in mind that resulting libraries should be further screened for gene families. There are border cases, where genome may contain thousands of modified copies of a gene, ranging from seemingly functional copies, through pseudogenes, gene fragments and single exons (i.e Speer family in rodents).
 ==Consensus Based==
 One has to have at least a draft of the genome or multiple genomic sequences.
+<!-- Since at least some next gen sequence assemblers (Newbler for 454 data) reject highly over-represented sequences during assembly, the true repeat content of the genome will be biased.
+-->
 ===RepeatScout===
 command line only, requires compilation
@@ Line 161: / Line 166: @@
 <!-- other contributors, put yourself here -->
+=References=
+* Church DM,  Goodstadt L,  Hillier LW,  Zody MC,  Goldstein S,  et al. 2009 Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse. PLoS Biol 7(5): e1000112. doi:10.1371/journal.pbio.1000112
 [[Category:Protocol]] [[Category:In silico]] [[Category:Data analysis]]
 [[Category:Protocol]] [[Category:In silico]] [[Category:Data analysis]]

Wikiomics:Repeat finding: Difference between revisions

Revision as of 02:45, 9 April 2010

Contents

Detecting known repeats

RepeatMasker

De novo repeat library construction

Consensus Based

RepeatScout

Input Reads

ReAS

Installation

Usage

References

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools