Short read toolbox

=Short Read Toolbox=

This page has been created to help list resources for working with high throughput sequencing data. You can also check out our individual lab pages to see updates on methods -- Cronn Lab or Liston:Lab -- to see updates on methods.

=Short Read Workshop= Download Presentations and Training Modules from our Recent Short Read Workshop, "An introduction to next-generation sequencing". Presented at the Botany 2010 Meeting in Providence, R.I.


 * [[media:Botany2010_workshop_agenda.pdf| Meeting Agenda]]
 * [[media:Botany2010_workshop_summaries.pdf| Presentation Summaries and Suggested Reading]]
 * [[media:Botany2010_workshop_training.pdf| Example Module A: Assembling chloroplast genomes from short reads]]
 * [[media:Botany2010_workshop_training.pdf| Example Module B: Programs and data to download]]

=Platforms= Currently available platforms:
 * Illumina - Illumina (formerly Solexa).
 * 454 - 454/Roche.
 * SOLiD - ABI by Life Technologies.

Anticipated technologies:
 * Ion Torrent Semiconductor - Ion Torrent.
 * SMRT - Pacific BioSciences.
 * Nanopore - Oxford Nanopore Technologies.

=Online short-read resources=
 * SEQanswers - Online forum for next generation sequencing.
 * SEQanswers software post - Post of software available for next generation sequence data.
 * SEQwiki - SEQ Answers wikilist of bioinformatic applications.
 * De novo tips - Blog on de novo assembly.
 * UCSC Bioinformatics - UC Santa Cruz's bioinformatics server.
 * Cipres - Cipres.
 * GMOD - Generic model organism database (GMOD) project collection of tools.
 * BioPieces - A collection of bioinformatic tools
 * Illumina Manuals username: guest password: illumina

=Sequence format information=
 * Short Read Toolbox - Descriptions and examples of qseq, scarf, fastq and fasta formats. Includes scripts to translate these formats to the fastq format standard.
 * FASTQ - Wikipedia's FASTQ page.
 * FASTA - Wikipedia's FASTA page.

=Alignment format information=
 * SAMtools - SAMtools.
 * AMOS - AMOS.
 * UCSC - UCSC's faq on file formats.

=Short-read quality control software=
 * TileQC - Requires R, RMySQL and MySQL.
 * FastQC - A quality control tool for high throughput sequence data. A Java application.
 * Short Read Toolbox - Scripts for quality control of Illumina data.

=Open source de novo assemblers=
 * Velvet - Implements De Bruijn Graphs in C. Requires 64 bit Linux OS.
 * Edena - 32 and 64 bit Linux.
 * ABySS - Multi-threaded de novo assembly.
 * Ray - Multi-threaded de novo assembly.


 * QSRA - Utilizes quality scores.

=Open source reference guided assemblers=
 * SOAP - Short Oligonucleotide Analysis Package.
 * MAQ - Mapping and Assembly with Qualities.
 * Bowtie - Bowtie. An ultrafast, memory-efficient short read aligner.
 * BWA - Burrows-Wheeler aligner.
 * RGA - Perl script which calls blat to assemble short reads.

=Hybrid assemblers (reference guided & de novo)=
 * YASRA - Yet Another Short Read Aligner.
 * Aakrosh Ratan dissertation - Description of YASRA.
 * Liston:Computer_Scripts - Scripts for post-processing of YASRA contigs.

=RNA-Seq / Transcriptome=
 * TopHat - A fast splice junction mapper for RNA-Seq reads.
 * Cufflinks - Assembles transcripts, estimates their abundances, and tests for differential expression and regulation.
 * SuperSplat - Splice junction discovery.

=Assembly viewers=
 * Tablet - Tablet, visualizes ACE, AFG, MAQ, SOAP, SAM and BAM formats.
 * SAMtools - SAMtools.

=Alignment programs=
 * MAFFT - MAFFT.
 * T-Coffee - T-Coffee.
 * Muscle - Muscle.
 * LASTZ - LASTZ, hosted at the Miller lab.
 * MUMmer - MUMmer.
 * Mulan Multiple Sequence Alignment and Visualization Tool.
 * VISTA Tools for Comparative Genomics.
 * mauve - Multiple (bacterial) genome aligment.

=Sequence query programs=
 * BLAST - BLAST.
 * PLAN - A web application for conducting, organizing, and mining large-scale BLAST searches (limited to 1,000 queries).
 * BLAT - BLAT.

=Linux=
 * [[media:Essential_Linux.pdf | Essential Linux]]

=Perl= A very brief example to demonstrate file input/output.

Code: use strict; use warnings; my (@temp, $in, $out); my $inf = "data.fq"; my $outf = "data_out.fq"; open($in, "<", $inf) or die "Can't open $inf: $!"; open($out, ">", $outf) or die "Can't open $outf: $!"; while(<$in>){ chomp($temp[0]=$_); # First line is an identifier. chomp($temp[1]=<$in>); # Second line is sequence. chomp($temp[2]=<$in>); # Third line is an identifier. chomp($temp[3]=<$in>); # Fourth line is quality. print $out join("\t", @temp)."\n"; } close $in or die "$in: $!"; close $out or die "$out: $!";
 * 1) !/usr/bin/perl
 * perlintro - Introduction to perl with links to other documentation.
 * BioPerl beginners - Introduction to BioPerl (be prepared for object oriented code).

=Python=
 * Python tutorial
 * Biopython

=R project=
 * R project - Statistical programming environment.
 * Bioconductor - R for biologists (micro-array and next generation data).
 * APE - Analysis of phylogenetics and evolution R package.
 * HT Sequence Analysis with R and Bioconductor

=Useful links=
 * User:Brian J. Knaus
 * Cronn Lab
 * Liston Lab