Wikiomics:RNA-Seq: Difference between revisions
Darek Kedra (talk | contribs) |
Darek Kedra (talk | contribs) m (→last) |
||
(104 intermediate revisions by the same user not shown) | |||
Line 6: | Line 6: | ||
==Spliced Mappers (tested)== | ==Spliced Mappers (tested)== | ||
===Tophat=== | ===Tophat and Cufflinks=== | ||
http://tophat.cbcb.umd.edu/ | * Tophat | ||
web: http://tophat.cbcb.umd.edu/ | |||
current version: | current version: 2.0.4 released 2012.06.21 | ||
active development: yes | |||
platforms | platforms | ||
Linux x86_64 binary | * Linux x86_64 binary | ||
Mac OS X x86_64 binary | * Mac OS X x86_64 binary | ||
source code | * source code | ||
requirements: SAMtools (http://samtools.sourceforge.net/) | |||
base mapper: bowtie2/bowtie | |||
input: fastq, fasta | |||
output: BAM | output: BAM | ||
Currently the most widely used program for RNA-Seq mapping. Output often processed with Cufflinks. | Currently the most widely used program for RNA-Seq mapping. Output often processed with Cufflinks. | ||
Latest version supports TopHat detection of insertions and deletions using RNA-Seq data. | Latest version supports TopHat detection of insertions and deletions using RNA-Seq data. | ||
Supports mixed length reads (suitable i.e. for 454 data) | |||
* Cufflinks: | |||
web: http://cufflinks.cbcb.umd.edu/ | |||
current version: 2.0.1 released 2012.06.15 | |||
===GMAP/GSNAP=== | |||
http://research-pub.gene.com/gmap/ | |||
current version: 2012-11-07 | |||
active developement: yes | |||
FastA and FASTQ input, support for paired ends, gziped fastq files, circular chromosomes | |||
To compile: | |||
<pre> | |||
./configure --prefix=/some/path/for_install/ --with-gmapdb=/path_to/gmap_DBs/ --with-samtools=/path_to/samtools_0.1.18/ | |||
make | |||
make install | |||
</pre> | |||
To run: | |||
<pre> | |||
gsnap --dir=genome_directory --db=genome_database --batch=2 --localsplicedist=10000 --nthreads=4 --nofails my_reads.fasta | |||
</pre> | |||
Check "gsnap --help" for more detailed options | |||
===GEM=== | |||
web: http://algorithms.cnag.cat/wiki/The_GEM_library | |||
current version: GEM-binaries-Linux-x86_64-core_i3-20121106-022124.tbz2 from 2012-11-06 | |||
active development: yes | |||
base mapper: gem-mapper and gem-split-mapper | |||
Developed in Ocaml and Python. | |||
Two step mapping (unspliced mode first, then unmapped reads are mapped with splicing). | |||
Article: http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.2221.html | |||
===HMMSplicer=== | ===HMMSplicer=== | ||
http://derisilab.ucsf.edu/index.php?software=105 | http://derisilab.ucsf.edu/index.php?software=105 | ||
current version: 0.9.5 2010.11.25 | current version: 0.9.5 from 2010.11.25 | ||
active development: maybe (no new release in a 2 years) | |||
base mapper: bowtie | base mapper: bowtie | ||
Line 53: | Line 110: | ||
Caveat: due to training process you have to use reads of the same length. | Caveat: due to training process you have to use reads of the same length. | ||
===SOAPsplice=== | ===SOAPsplice=== | ||
http://soap.genomics.org.cn/soapsplice.html | http://soap.genomics.org.cn/soapsplice.html | ||
current version: 1. | current version: 1.9 from 2011.02.23 | ||
active development: yes | |||
The SOAPals website provides detailed information how to install and run it plus performance evaluation data. | |||
==SOLiD data only== | ==SOLiD data only== | ||
(untested) | (untested) | ||
=== SplitSeek === | === SplitSeek === | ||
http://solidsoftwaretools.com/gf/project/splitseek/ | http://solidsoftwaretools.com/gf/project/splitseek/ (warning: dead link. solidsoftwaretools.com closed) | ||
current version: 1.3.4 | |||
Ameur A, Wetterbom A, Feuk L, Gyllensten U. Global and unbiased detection of splice junctions from RNA-seq data. Genome Biol. 2010 Mar 17;11(3):R34. | Ameur A, Wetterbom A, Feuk L, Gyllensten U. Global and unbiased detection of splice junctions from RNA-seq data. Genome Biol. 2010 Mar 17;11(3):R34. | ||
Line 109: | Line 131: | ||
Developed in Perl. | Developed in Perl. | ||
= | It requires AB WT Analysis Pipeline http://solidsoftwaretools.com/gf/project/transcriptome/ which breaks while compiling out of the box with gcc 4.4.x. | ||
Solution: | |||
# edit ./readsmap/src/simu.cxx file | |||
# replace line 27: | |||
<pre>char *s = strchr(seq, '.');</pre> | |||
with this one: | |||
<pre>const char *s = strchr(seq, '.');</pre> | |||
Solved by Paolo Di Tommaso from CRG. | |||
===X-MATE=== | ===X-MATE=== | ||
Line 121: | Line 151: | ||
written in: perl (with some optional python, Java and C++) | written in: perl (with some optional python, Java and C++) | ||
Requires junction libraries (available from X-MAte web site for human and mouse). | |||
==Spliced Mappers (in developement)== | ==Spliced Mappers (in developement)== | ||
=== Olego === | |||
web: http://ngs-olego.sourceforge.net/ | |||
===PALMapper (fusion of GenomeMapper & QPALMA)=== | ===PALMapper (fusion of GenomeMapper & QPALMA)=== | ||
http:// | web: http://raetschlab.org//suppl/palmapper | ||
current version: palmapper-0.4 | current version: palmapper-0.4.tar.gz 2011.07.04 | ||
active development: yes(?) | |||
Simple installation (run "make" in installation directory -> tested on Debian 6.0 with gcc ). To check the install go to "testcase" and run "make" again. This requires fast Internet connection as it downloads genome and FASTQ files. | |||
<!---- | |||
Output files in a tabulated, but no gff format. | |||
Creating index: | Creating index: | ||
Line 151: | Line 185: | ||
CAVEAT: It requires QPALMA parameter files. These seem to be both species and tissue specific, plus there is a distinction between paired vs unpaired parameter files i.e. human_HepG2_left_l75.qpalma. For creating these one needs to install QPALMA itself (unsuccessful install in the past, not tested recently). | CAVEAT: It requires QPALMA parameter files. These seem to be both species and tissue specific, plus there is a distinction between paired vs unpaired parameter files i.e. human_HepG2_left_l75.qpalma. For creating these one needs to install QPALMA itself (unsuccessful install in the past, not tested recently). | ||
----> | |||
===Mapsplice=== | ===Mapsplice=== | ||
http://www.netlab.uky.edu/p/bioinfo/MapSplice/ | web: http://www.netlab.uky.edu/p/bioinfo/MapSplice/ | ||
current version: MapSplice 1.15.2 from 2011.4.12 | |||
active development: maybe (no new release in a year) | |||
base mapper: bowtie (new version finally supports bowtie 0.12.7) | |||
Caveat: reference genome sequence is chromosome oriented (= one fasta file for a chromosome). | |||
===SpliceMap=== | ===SpliceMap=== | ||
Line 163: | Line 202: | ||
current version: 3.3.5.2 2010.10.23 | current version: 3.3.5.2 2010.10.23 | ||
active development: maybe (no new release in a 1.5 years) | |||
base mapper (preferred): bowtie (others possible) | base mapper (preferred): bowtie (others possible) | ||
Line 185: | Line 225: | ||
Malachi Griffith, Griffith OL, Mwenifumbo J, Morin RD, Goya R, Tang MJ, Hou YC, Pugh TJ, Robertson G, Chittaranjan S, Ally A, Asano JK, Chan SY, Li I, McDonald H, Teague K, Zhao Y, Zeng T, Delaney AD, Hirst M, Morin GB, Jones SJM, Tai IT, Marco A. Marra. Alternative expression analysis by RNA sequencing. Nature Methods. 2010 Oct;7(10):843-847. | Malachi Griffith, Griffith OL, Mwenifumbo J, Morin RD, Goya R, Tang MJ, Hou YC, Pugh TJ, Robertson G, Chittaranjan S, Ally A, Asano JK, Chan SY, Li I, McDonald H, Teague K, Zhao Y, Zeng T, Delaney AD, Hirst M, Morin GB, Jones SJM, Tai IT, Marco A. Marra. Alternative expression analysis by RNA sequencing. Nature Methods. 2010 Oct;7(10):843-847. | ||
version: ALEXA-Seq_v.1. | version: ALEXA-Seq_v.1.17 from Feb 2012 | ||
Available configured virtual machines (for VMware) ver. 1.12 | Available configured virtual machines (for VMware) ver. 1.12 | ||
===GMORSE=== | ===GMORSE=== | ||
http://www.genoscope.cns.fr/externe/gmorse/ | www: http://www.genoscope.cns.fr/externe/gmorse/ | ||
Proper name: G-Mo.R-Se | Proper name: G-Mo.R-Se | ||
It was used for Vitis vinifera genome project version. | current version: gmorse_02mar2011.tar.gz 2011.03.02 (new version) | ||
It was used for Vitis vinifera genome project (old version). | |||
===ERANGE=== | ===ERANGE=== | ||
http://woldlab.caltech.edu/rnaseq/ | www: http://woldlab.caltech.edu/rnaseq/ | ||
current version: 3. | current version: Interim Erange3.3 release plus 4.0a new version | ||
base mapper: bowtie or blat | base mapper: bowtie or blat | ||
requirements: | requirements: | ||
5 | # Python 2.5+ | ||
# Cistematic 3.0 from http://cistematic.caltech.edu | |||
# Cistematic version of the genomes, unless providing your own custom genome and annotations. | |||
# You will need genomic sequences to build the expanded genome, as well as gene models from UCSC. | |||
# (Optional) Python is very slow on large datasets. Use of the psyco module (psyco.sf.net) on 32-bit Linux or all Mac Intel machines to significantly speed up runtime is highly recommended. | |||
# (Optional) Several of the plotting scripts also rely on Matplotlib, | |||
which is available at matplotlib.sf.net. | which is available at matplotlib.sf.net. | ||
Line 225: | Line 263: | ||
Transcriptome Assembly Utility: requires already mapped input. | Transcriptome Assembly Utility: requires already mapped input. | ||
Compatible mappers: Blat, Eland and HashMatch. Also accepts gff3 files. | Compatible mappers: Blat, Eland and HashMatch. Also accepts gff3 files. | ||
==Not spliced== | ==Not spliced== | ||
Line 242: | Line 275: | ||
download from: http://users.soe.ucsc.edu/~kent/src/ | download from: http://users.soe.ucsc.edu/~kent/src/ | ||
last version: blatSrc34.zip 2007.04.20 | |||
Options used to produce hints for Augustus gene prediction program: | Options used to produce hints for Augustus gene prediction program: | ||
Line 271: | Line 304: | ||
===last=== | ===last=== | ||
http://last.cbrc.jp/ | www: http://last.cbrc.jp/ | ||
Latest: last- | Latest: last-262.zip from 2012.10.23 | ||
paper: http://genome.cshlp.org/content/21/3/487.full | |||
requirements: min 2GB RAM/mammalian genome, 16-20GB recommended for optimal performance | requirements: min 2GB RAM/mammalian genome, 16-20GB recommended for optimal performance | ||
Line 309: | Line 344: | ||
</pre> | </pre> | ||
To convert it to bam format use samtools: | |||
<pre> | |||
samtools view -but my_genome.fasta.fai -o reads_vs_my_genome_db.unsorted.bam reads_vs_my_genome_db.sam | |||
samtools sort reads_vs_my_genome_db.unsorted.bam reads_vs_my_genome_db.sorted | |||
samtools index reads_vs_my_genome_db.sorted.bam | |||
</pre> | |||
(tested with samtools ver 0.1.13) | |||
Caveat: default options result in mapping homopolymeric runs/simple repeats etc. to multiple genome position. To avoid this genome should be softmasked first. | |||
==Obsolete / not available== | |||
===RNA-mate=== | |||
http://solidsoftwaretools.com/gf/project/rnamate (dead link) | |||
current version: 1.01 | |||
No activity since 2009. Successor: X-MATE | |||
===SOAPals === | |||
http://soap.genomics.org.cn/soapals.html | |||
current version: 1.1 , 05-05-2010 | |||
The SOAPals website provides exact informations how to install and run it. | |||
Note 2011.03.01 | |||
Seems that SOAPals has been replaced by SOAPsplice, and SOAPals is not available anymore. | |||
===SAW (method no software yet)=== | |||
Ning K, Fermin D (2010) SAW: A Method to Identify Splicing Events from RNA-Seq Data Based on Splicing Fingerprints. PLoS ONE 5(8): e12047. doi:10.1371/journal.pone.0012047 |
Revision as of 06:35, 9 November 2012
This list is intended mostly for de novo splice site / transcript / gene prediction in newly sequenced genomes. At the same time tools listed below often are used in other pipelines such as transcript quantification or SNP discovery.
Mappers
Spliced Mappers (tested)
Tophat and Cufflinks
- Tophat
web: http://tophat.cbcb.umd.edu/
current version: 2.0.4 released 2012.06.21
active development: yes platforms
- Linux x86_64 binary
- Mac OS X x86_64 binary
- source code
requirements: SAMtools (http://samtools.sourceforge.net/)
base mapper: bowtie2/bowtie
input: fastq, fasta
output: BAM
Currently the most widely used program for RNA-Seq mapping. Output often processed with Cufflinks. Latest version supports TopHat detection of insertions and deletions using RNA-Seq data. Supports mixed length reads (suitable i.e. for 454 data)
- Cufflinks:
web: http://cufflinks.cbcb.umd.edu/
current version: 2.0.1 released 2012.06.15
GMAP/GSNAP
http://research-pub.gene.com/gmap/
current version: 2012-11-07
active developement: yes
FastA and FASTQ input, support for paired ends, gziped fastq files, circular chromosomes
To compile:
./configure --prefix=/some/path/for_install/ --with-gmapdb=/path_to/gmap_DBs/ --with-samtools=/path_to/samtools_0.1.18/ make make install
To run:
gsnap --dir=genome_directory --db=genome_database --batch=2 --localsplicedist=10000 --nthreads=4 --nofails my_reads.fasta
Check "gsnap --help" for more detailed options
GEM
web: http://algorithms.cnag.cat/wiki/The_GEM_library
current version: GEM-binaries-Linux-x86_64-core_i3-20121106-022124.tbz2 from 2012-11-06 active development: yes
base mapper: gem-mapper and gem-split-mapper
Developed in Ocaml and Python. Two step mapping (unspliced mode first, then unmapped reads are mapped with splicing).
Article: http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.2221.html
HMMSplicer
http://derisilab.ucsf.edu/index.php?software=105
current version: 0.9.5 from 2010.11.25
active development: maybe (no new release in a 2 years)
base mapper: bowtie
input: fastq (converts quality values to phred scale)
output: bed file of junctions
Developed in Python. Requirements:
- OS: tested on MacOS X (authors), Linux Fedora 8,
- Python 2.6 (tested with 2.6.4)
- numpy (tested by authors with version 1.3.0)
- bowtie (works with 0.12.7)
Also completes running example with Python 2.7.1rc1, numpy-1.5.1 and bowtie 0.12.7 on in-house data.
Basic command:
python runHMM.py -o output_dir -i input_RNA-seq_data.qseq -q quality_type -g genome4mapping -j min_intron_size -k max_intron_size -p number_of_procesors_to_use
type: python runHMM.py --help for more explanation
Tip: you can map your reads first in a non-spliced mode with a mapper of your choice, filter out all mapped reads and feed HMMsplicer with just unmapped reads.
Caveat: due to training process you have to use reads of the same length.
SOAPsplice
http://soap.genomics.org.cn/soapsplice.html
current version: 1.9 from 2011.02.23
active development: yes
The SOAPals website provides detailed information how to install and run it plus performance evaluation data.
SOLiD data only
(untested)
SplitSeek
http://solidsoftwaretools.com/gf/project/splitseek/ (warning: dead link. solidsoftwaretools.com closed)
current version: 1.3.4
Ameur A, Wetterbom A, Feuk L, Gyllensten U. Global and unbiased detection of splice junctions from RNA-seq data. Genome Biol. 2010 Mar 17;11(3):R34.
Developed in Perl.
It requires AB WT Analysis Pipeline http://solidsoftwaretools.com/gf/project/transcriptome/ which breaks while compiling out of the box with gcc 4.4.x.
Solution:
- edit ./readsmap/src/simu.cxx file
- replace line 27:
char *s = strchr(seq, '.');
with this one:
const char *s = strchr(seq, '.');
Solved by Paolo Di Tommaso from CRG.
X-MATE
http://grimmond.imb.uq.edu.au/X-MATE/
current version: Oct 2010?
written in: perl (with some optional python, Java and C++) Requires junction libraries (available from X-MAte web site for human and mouse).
Spliced Mappers (in developement)
Olego
web: http://ngs-olego.sourceforge.net/
PALMapper (fusion of GenomeMapper & QPALMA)
web: http://raetschlab.org//suppl/palmapper
current version: palmapper-0.4.tar.gz 2011.07.04
active development: yes(?)
Simple installation (run "make" in installation directory -> tested on Debian 6.0 with gcc ). To check the install go to "testcase" and run "make" again. This requires fast Internet connection as it downloads genome and FASTQ files.
Mapsplice
web: http://www.netlab.uky.edu/p/bioinfo/MapSplice/
current version: MapSplice 1.15.2 from 2011.4.12 active development: maybe (no new release in a year)
base mapper: bowtie (new version finally supports bowtie 0.12.7)
Caveat: reference genome sequence is chromosome oriented (= one fasta file for a chromosome).
SpliceMap
http://www.stanford.edu/group/wonglab/SpliceMap/
current version: 3.3.5.2 2010.10.23 active development: maybe (no new release in a 1.5 years)
base mapper (preferred): bowtie (others possible) "Currently, only the canonical GT-AG splice sites are identified."
Requirements:
- 8GB minimum for human genome, 16GB recommended
- input formats: RAW, FASTQ or FASTA
- Read >= 50bp
Base mappers:
- Bowtie (preferred)
- others: SeqMap, Eland
Features: "Support for arbitrarily long uneven read lengths"
Alexa-Seq
http://www.alexaplatform.org/alexa_seq/downloads.htm
Malachi Griffith, Griffith OL, Mwenifumbo J, Morin RD, Goya R, Tang MJ, Hou YC, Pugh TJ, Robertson G, Chittaranjan S, Ally A, Asano JK, Chan SY, Li I, McDonald H, Teague K, Zhao Y, Zeng T, Delaney AD, Hirst M, Morin GB, Jones SJM, Tai IT, Marco A. Marra. Alternative expression analysis by RNA sequencing. Nature Methods. 2010 Oct;7(10):843-847.
version: ALEXA-Seq_v.1.17 from Feb 2012
Available configured virtual machines (for VMware) ver. 1.12
GMORSE
www: http://www.genoscope.cns.fr/externe/gmorse/
Proper name: G-Mo.R-Se
current version: gmorse_02mar2011.tar.gz 2011.03.02 (new version)
It was used for Vitis vinifera genome project (old version).
ERANGE
www: http://woldlab.caltech.edu/rnaseq/
current version: Interim Erange3.3 release plus 4.0a new version
base mapper: bowtie or blat
requirements:
- Python 2.5+
- Cistematic 3.0 from http://cistematic.caltech.edu
- Cistematic version of the genomes, unless providing your own custom genome and annotations.
- You will need genomic sequences to build the expanded genome, as well as gene models from UCSC.
- (Optional) Python is very slow on large datasets. Use of the psyco module (psyco.sf.net) on 32-bit Linux or all Mac Intel machines to significantly speed up runtime is highly recommended.
- (Optional) Several of the plotting scripts also rely on Matplotlib,
which is available at matplotlib.sf.net.
TAU
http://mocklerlab-tools.cgrb.oregonstate.edu/TAU.html
current version: 1.4 2010.09.06
Transcriptome Assembly Utility: requires already mapped input. Compatible mappers: Blat, Eland and HashMatch. Also accepts gff3 files.
Not spliced
Mapping short reads to draft genome sequence with multiple contigs poses problems for current spliced mappers.
blat
http://genome.ucsc.edu/FAQ/FAQblat.html
Detailed description: http://genome.ucsc.edu/goldenPath/help/blatSpec.html
download from: http://users.soe.ucsc.edu/~kent/src/
last version: blatSrc34.zip 2007.04.20
Options used to produce hints for Augustus gene prediction program: (based on: http://augustus.gobics.de/binaries/readme.rnaseq.html)
blat -noHead -stepSize=5 -minIdentity=93 genome.masked.fa rnaseq.fa ali.psl
bahlerlab (nature Protocols 200 Defining transcribed regions using RNA-seq Brian T Wilhelm, Samuel Marguerat, Ian Goodhead & Jürg Bähler http://www.bahlerlab.info/docs/nprot.2009.229.pdf
blat -noHead -out=psl -oneOff=1 -tileSize=8 FASTA_genome.txt FASTA_sequences.txt Output.bsl
Pash
Pash 3.0: A versatile software package for read mapping and integrative analysis of genomic and epigenomic variation using massively parallel DNA sequencing. BMC Bioinformatics. 2010 Nov 23;11(1):572 Authors: Coarfa C, Yu F, Miller CA, Chen Z, Harris RA, Milosavljevic A
Download: http://www.brl.bcm.tmc.edu/pash/pashDownload.rhtml
current version: 3.0.6.2
last
www: http://last.cbrc.jp/
Latest: last-262.zip from 2012.10.23
paper: http://genome.cshlp.org/content/21/3/487.full
requirements: min 2GB RAM/mammalian genome, 16-20GB recommended for optimal performance
Installation:
cd src; make
Creating genomic database and short reads mapping:
#db creation lastdb -m1111110 -s20G -v my_genome_db my_genome.fasta #mapping lastal -Q3 -o reads_vs_my_genome_db.out -f 0 -v my_genome_db reads.fastq
where: -Q3: fastq Illumina format -f 0: output in tabulated format -v: verbose (prints what it is doing)
last can map reads with indels and truncate large parts of the reads (highly sensitive but with lower specificity). For example it can report just 30 nucleotide long matches out of 54nn long queries. Output needs to be filtered from spurious matches.
It does not have multiple processor option, so for faster mapping one has to split fastq file(s), run last in parallel and combine the results (or use Hadoop).
Since version 149 it is possible to get SAM output by two step procedure:
#get MAF output first lastal -Q3 -o reads_vs_my_genome_db.maf -f 1 -v my_genome_db reads.fastq #convert MAF to SAM using maf-convert.py from scripts directory maf-convert.py sam reads_vs_my_genome_db.maf > reads_vs_my_genome_db.sam
To convert it to bam format use samtools:
samtools view -but my_genome.fasta.fai -o reads_vs_my_genome_db.unsorted.bam reads_vs_my_genome_db.sam samtools sort reads_vs_my_genome_db.unsorted.bam reads_vs_my_genome_db.sorted samtools index reads_vs_my_genome_db.sorted.bam
(tested with samtools ver 0.1.13)
Caveat: default options result in mapping homopolymeric runs/simple repeats etc. to multiple genome position. To avoid this genome should be softmasked first.
Obsolete / not available
RNA-mate
http://solidsoftwaretools.com/gf/project/rnamate (dead link)
current version: 1.01
No activity since 2009. Successor: X-MATE
SOAPals
http://soap.genomics.org.cn/soapals.html
current version: 1.1 , 05-05-2010
The SOAPals website provides exact informations how to install and run it.
Note 2011.03.01
Seems that SOAPals has been replaced by SOAPsplice, and SOAPals is not available anymore.
SAW (method no software yet)
Ning K, Fermin D (2010) SAW: A Method to Identify Splicing Events from RNA-Seq Data Based on Splicing Fingerprints. PLoS ONE 5(8): e12047. doi:10.1371/journal.pone.0012047