Wikiomics:Bioinfo tutorial

Wikiomics:Bioinfo tutorial (exercises)

Introduction
The goal of this document is to provide a set of illustrative cookbook recipes for wet lab biologists. As there is more than one way of performing a given task, it does point to different tools, but it does not even try to become a yet another directory of links.

You are strongly encouraged to post your comments/ideas on the discussion page of this tutorial.

Before you start

 * use text files/plain e-mail whenever possible
 * give meaningful names to your files (i.e. humP53_XP12345.fas)
 * do not use spaces or Unix special characters in file names (!?"'%&^~*$|/\{}[]<>:)
 * create separate folders/directories for each project with meaningful names
 * do not waste paper for printing out your blast search results or sequences
 * if primary data needs to be archived burn it on CD using (if possible) non vendor-specific data format

Operating systems/General software
For installing standalone bioinformatics programs the order of preference is: Linux/Unix > MacOS > Windows

Once you move from worksation to bioinformatics application server (almost always Linux/Unix) then as long as you have a Xwindows connection (preferably ssh tunneled) operating system of client workstation plays a minor (if any) role.

Sequence formats
There are multiple sequence formats commonly used. These can be divided into text-only and binary formats.

Unaligned sequences

 * Raw sequence ( example) The text file contains nothing but sequence. The information that's inside has to be encoded in the name of the file. Some web servers still require it.


 * FASTA (description, example) The most commonly used format (by programs/servers). It can contain either single or multiple sequences, each having a name line, also known as the header, starting with ">" in the first column. Keep in mind that various programs/servers introduce extra restrictions how the FastA header should look like. The "safe" standard:
 * Alphanumeric characters only [a-z, A-Z, 0-9], avoid spaces!
 * Keep the names short [i.e. 30 characters are significant for clustalw, only 8 for older versions of t_coffee]
 * In multiple FastA format keep the names unique

Whenever you plan to analyze your sequence using web servers or command line programs, keep it as FASTA. It is easy to cut the raw sequence out of it, and if your file name/sequence names contain accession numbers to find the GenBank entry.


 * GenBank format of choice used by GenBank to display extra information about the sequence. It is fairly complex, but easily readable by humans. Keep your sequences as GenBank when you need extra annotations provided by this format.


 * GCG GCG package own format. If you need to use GCG, you will have to convert your sequences into it


 * XML eXtended Markeup Language. Standard language slowly taking over the Net. XML files are computer-readable (= information is fairly easy to extract from them) and possible to read by humans. Format of choice when complex sequence information need to be described and computer accessible.


 * EXP format used by Wikiomics:Staden sequencing package. Contains both sequence and annotations understandable by Wikiomics:gap4.


 * GFF format for (mostly) genomic sequence annotations used by Ensembl

Aligned sequences

 * MSF Multiple Sequence Format. Used as output for multiple sequence alignments (pileup from GCG) and as input for alignment editors and shadowing programs.


 * ALN standard for multiple sequence alignments. Output format of clustalw and t_coffee.


 * PHYLIP used as an input in such programs like PHYLIP package, phyml


 * NEXUS advanced format allowing more annotation, used by MrBayes

Binary

 * ABI binary output format of ABI sequencers. Bulky (= unnecessary large), but so far the only one readable on Mac-OS. Can be easily converted into SCF.


 * SCF Sequence Common Format. Standard output for ALF sequencers. More compact than ABI, and commonly used in sequencing projects. Readable on PC and Unix, binary format of the Staden package.


 * ZTR successor of SCF with internal data compression. Currently the default binary format for Staden.

Sequence conversion tools
OK for simple format, has problems with aligned sequences A function in Sean Eddy's SQUID package for converting between unaligned and aligned sequence formats. can be used for sequence conversion, also for many other bioinformatic related functions.
 * readseq
 * sreformat from SQUID
 * BioPerl::SeqIO
 * for convertions between aligned formats (ALN, MSF, NEXUS, PHYLIP) use:
 * 1) SeaView GUI program
 * 2) ToPhylip (WWW) ALN and MSF -> Phylip

Processing ABI files and sequence assembly
Before you even start sequencing, plan the way the sequences will be processed. The most important is to establish a proper naming scheme compatible with various programs you (or others) may use downstream for sequence processing. The basic rules for file names are: prefix_my-template.my-primer_reaction-number
 * use alphanumeric characters [a-z, A-Z, 0-9]
 * use underscore "_" instead of spaces
 * full stop shall separate information about sequencing template and primer, like:

Even small sized projects become cumbersome if file names are not unique (you can not assembly files having the same name!). Once these requirements are satisfied, file name alone will give you enough information about:
 * whose sample is it?
 * to which sequencing project does it belong?
 * what was the sequencing template?
 * which primer was used?
 * was this sample sequenced before?

Fasta files (no traces)
phrap your_multiple_fasta_file.fa
 * phrap

Results in: your_multiple_fasta_file.fa.contigs

Used for i.e assembly of ESTs.

ABI/SCF files (traces)
recommended: see also another Wikiomics page Wikiomics:Staden
 * Staden package (Unix, Windows, OSX)

Takes a directory or a list of files for preprocessing (creating EXP files, clipping bad quality sequence, marking sequencing vector, repeats etc.
 * pregap4 (sequence pre-processing)

Creating sequence out of ABI chromatogram file is done by a basecalling program. One can go along with ABI-provided basecaller or preferably use:
 * phred  giving more accurate calls for less accurate part of the sequence (like at the end of the run, say 600bp and more) . Phred also gives a probability/quality values for each of the bases allowing more accurate assembly.

pregap4 %

gap4 & or (to open a specific database at startup): gap4 database.version.aux &
 * gap4 (sequence assembly, editing, verification and much more)

Other packages:
 * Consed Consed
 * Sequencher

Vector masking
Good for a single sequence.
 * EMVEC (web and blast2 based)


 * vectorstrip (Emboss)

vectorstrip -sequence my_fasta_file -vectors my_vectors_fasta_file -mismatch 30 -linkera TAGGATACAG -linkerb TTTTTCAGGACAGACA


 * linkera: 5' sequence
 * linkerb: 3' sequence


 * cross_match (from phrap package)

cross_match my_fasta_file my_vectors_fasta_file -minmatch 10 -minscore 20 -screen > screen.out

You will get a file: my_fasta_file.screen --- "devectorised" my_fasta_file

For masking out, let's say, one gene/overlapping clone, etc., type: cross_match fasta_file_to_be_masked masking_sequence -minmatch 10 -minscore 20 -screen > screen.out

You will get a file: fasta_file_to_be_masked.screen without the sequences matching masking_sequence.

A novel approach for finding putative vector/adapter sequences without knowledge of the vector sequence. "vector sequence is automatically inferred by analyzing the frequency of occurrence of short oligo-nucleotides using Poisson statistics" http://sourceforge.net/apps/mediawiki/amos/index.php?title=Figaro
 * Figaro

Masking repetitive sequences using RepeatMasker (WWW/Unix)

 * web site: http://www.repeatmasker.org/
 * current version (checked on 2010-03.22): 3.2.8
 * documentation: http://www.repeatmasker.org/webrepeatmaskerhelp.html


 * Online web server


 * command line

You have to have a FastA file (it can be multiple FastA). Type:

repmask your_sequence_in_fasta_format

You will get a file: your_sequence_in_fasta_format.masked --- name tells all

species options (choose only one): -m(us) masks rodent specific and mammalian wide repeats -rod(ent) same as -mus -mam(mal) masks repeats found in non-primate, non-rodent mammals -ar(abidopsis) masks repeats found in Arabidopsis -dr(osophila) masks repeats found in Drosophilas -el(egans) masks repeats found in C. elegans

Primer design
General info about primer design/PCR:
 * http://www.mcb.uct.ac.za/pcroptim.htm

online:
 * Primer3plus
 * BatchPrimer3 derivative program for designing multiple primers at once


 * Pythia new software claiming superiority over primer3


 * Oligo (PC & Mac ) $$$, first choice for small projects


 * JCVI Primer Designer pipeline primer design for DNA sequencing projects

Also search http://sourceforge.net/ with i.e. "primer pcr" for another, more task specific programs.


 * Primer Select from Lasergene (PC & Mac)

DNA -> protein translation

 * transeq (Emboss) WWW server
 * Protein Translate (ExPasy): WWW server
 * ORF Finder WWW server

ESTs based
Pre-clustered ESTs:
 * STACKdb (limited access, tissue specific splice forms)
 * Unigene (no consensus sequence)
 * TIGR Gene Indexes

Genome based

 * based on homology
 * de novo

see also separate page: Wikiomics:Cloning_in_silico

Genome annotation using ESTs assembly

 * PASA tool used in many current genome annotation projects

Synthetic genes
Many genes due to an extreme codon bias can not be expressed in model organisms, such as E.coli. To fix this problem one can back-translate protein sequence into nucleotides, using codons preferably used by E.coli. Moreover one can construct sucgh gene from a set of overlapping oligos, introducing restriction sites or other features at will. Programs performing such tasks:


 * JCAT http://www.jcat.de/
 * GeneDesign
 * SGD Syntetic Gene Designer
 * Synthetic Gene DataBase
 * stand alone program UpGene (SmallTalk) http://www.vectorcore.pitt.edu/upgene/upgene.html

http://www.microbialcellfactories.com/content/4/1/33
 * Codon preference in Pichia pastoris

Applied Microbiology and Biotechnology Issue: Volume 67, Number 3 Date: May 2005 Pages: 289 - 298 Strategies for efficient production of heterologous proteins in Escherichia coli S. Jana1 and J. K. Deb
 * Ref

Codon usage
There is a bias in codon usage resulting in variation in translation speed/efficiency. The bias itself varies between organisms (i.e. it is strong in yeast, then in E.coli, much weaker in B.subtilis).


 * codon usage database
 * graphical codon usage analysis http://gcua.schoedl.de/

One can calculate a number of codon bias indexes using programs:
 * CodonW web server, Sourceforge project


 * E-CAI HTML "The E-CAI server provides a direct threshold value for discerning whether the differences in CAI are statistically significant or whether they are merely artifacts that arise from internal biases in the G+C composition and/or amino acid composition of the query sequences."

RNA
The Vienna RNA package has solutions for many classical RNA problems. It well worth taking some time to familiarize oneself with the algorithms in this package before checking out the alternative methods.

Secondary structure prediction
For Wikiomics:RNA secondary structure prediction from a single sequence the most commonly used tools are MFold and RNAfold. For more single sequence and other comparative methods a large range of methods are available.

For the best accuracy (and evolutionary information) comparative approaches, in particular the alignment folding methods, have proven rather successful. Typically a researcher will align sequences using tools such as ClustalW, MAFFT and ProAlign and then fold the alignment using tools such as RNAalifold or Pfold.

RNA homology search
For the Wikiomics:RNA homology search problem (including secondary structure) a number of algorithms exist. E.g. Infernal and RaveNnA. A number of tools specializing in a specific RNA family are also available eg. tRNA, miRNA etc.

RNA gene-finding
Tools for Wikiomics:RNA gene-finding usually use as input sequence alignmments which are then scanned for potential conserved secondary structure signals. Useful examples of these are QRNA, RNAz and EvoFold.

RNA databases
A number of Wikiomics:RNA databases are available.
 * microRNA database (sequences, targets, sumbmission for publications)
 * Rfam a large collection of multiple sequence alignments and covariance models covering many common non-coding RNA families.
 * BRaliBase Benchmarks of RNA analysis tools on trusted ncRNA datasets.

Protein Alignment
See also article in wikipedia: Sequence alignment software

Two sequences
There are two kinds of alignment: local and global.


 * Local : finds the best region of similarity between two sequences
 * water (Emboss) uses the Smith-Waterman algorithm (modified for speed) web interface
 * lalign Huang and Miller algorithm
 * blast2
 * bestfit (GCG) Smith-Waterman algorithm


 * Global : covers the whole length of both sequences
 * needle (Emboss) Needleman-Wunsch global alignment
 * gap (GCG) Needleman-Wunsch global alignment

Multiple sequences
For a recent review see Notredame 2007 @PLOS

There is a trade off between accuracy and speed. For speed select MUSCLE. Accuracy of algorithms differs, with the best results on average being achieved using PRALINE and t-coffee/M-coffee. For poteins with multiple/repeated domains you may try graph based algorithms: poa or aba.

t_coffee your_fasta_file.fa T-Coffee has a limit of 50 sequences. The new version is called M-Coffee, it is a meta-aligner using output of several other algorithms (ClustalW, Poa, Muscle, ProbCons, MAFFT, Dialign-T, PCMA). To run it (bash): export POA_DIR=/absolute/path/to/Poa/installation/directory export DIALIGNT_DIR=/absolute/path/to/DIALIGN-T/conf t_coffee your_fasta_file.fa -special_mode mcoffee
 * t_coffee (good at producing alignments of distant (~30% similarity) sequences.


 * MAFFT one of the top alignment programs, with multiple options. See benchmarks from 2006 papers here

Uses PSI-BLAST for finding most conserved residues in each of the sequences. Very slow but could be more accurate as compared with other methods.
 * PRALINE (slow but takes into account blast results)

clustalw your_fasta_file.fa output as aln format
 * clustalw (standard algorithm most commonly used and "good enough" in most cases)

Improved, command line version is DIALIGN-T  dialign-t ./path/to/dialign-t/conf/ your_fasta_file.fa your_fasta_file.dialign-t output is in fasta format
 * dialign (good for creating alignments with long gaps as with non-orthologues sharing domains only)

muscle -clw -in your_fasta_file.fa -out your_fasta_file.aln -clw
 * MUSCLE http://www.ebi.ac.uk/muscle/

probcons -clustalw -ir 100 -pre 20 -a your_sequence_in_fasta_format.fa > probcons.aln
 * probcons http://probcons.stanford.edu/

Graph based algorithm replacing columns with directed acyclic graph. Permits domain recombinations of domains/exons: useful for the alignment of multidomain proteins and ESTs. poa -read_fasta your_fasta_file.fa -clustal output_file.aln /path/to/similarity/matrix/i.e/blosum80.mat
 * poa

Method using A-Bruijn graph. Good for alignment of proteins with: DNA: ABA is useful in the alignment of genomic sequences that contain duplications and inversions. Based on: http://www.genome.org/cgi/content/abstract/14/11/2336
 * aba
 * domains that are not present in all proteins
 * domains that are present in different orders in different proteins
 * domains that are present in multiple copies in some proteins


 * MEME (good for finding as-yet-unknown domains in a set of proteins) MEME web site

meme -nmotifs numberof_motifs_2_find -model oops -protein -sequence your_fasta_file.fa.fa -outfile your_fasta_file.fa.fa.meme Where 'model' could be oops/zoops/anr.

oops:   One Occurrence Per Sequence

zoops:  Zero or One Occurrence Per Sequence

anr: 	Any Number of Repetitions

MEME syntax (see more at http://meme.nbcr.net/meme4_3_0/documentation.html): meme your_input_fasta_file.fas        -nmotifs 20 -mod anr  -dna     > your_meme_output.htm &

Alignment accuracy assessment
all alignments have to in aligned fasta format and contain the same sequences in the same order.
 * MUMSA (WWW)

Paper: http://nar.oxfordjournals.org/cgi/content/full/33/22/7120

Alignment editors & beautifiers
good for sequence conversion (ALN, MSF, Fasta,Pfam, PIR, BLC) ALN only good for sequence conversion (ALN, MSF, Fasta, Mase, Phylip, Nexus)
 * Jalview (Java)
 * clustalx (PC, Unix)
 * seaview (Linux)


 * genedoc (PC only) genedoc(PC only) very flexible color schemes

ALN, MSF
 * boxshade (no editing, pretty prints only)

LaTeX package. accepts ALN and MSF formats.
 * TeXshade


 * WAViS web server
 * accepts multi FASTA with Clustal, NEXUS, GCG, Phylip conversion
 * EPS SVG output

Sequence logos
MSA or matrix
 * enoLOGOS (first choice)

MSA
 * WebLogo (input as Fasta, ALN formats)

HMM (single HMM profile)
 * LogoMat-M http://www.sanger.ac.uk/cgi-bin/software/analysis/logomat-m.cgi

(pair-wise HMM profiles)
 * LogoMat-P http://www.sanger.ac.uk/cgi-bin/software/analysis/logomat-p.cgi

coding DNA alignment based on existing protein alignment
This is a method of constructing more accurate DNA alignment for downstream use in phylogenic analyses taking into account coding properties of given DNA sequences.


 * PAL2NAL webserver capable of dealing with stop codons7frame shifts, suitable also for pseudogenes


 * transAlign Perl script, requires Clustalw article
 * RevTrans
 * protal2dna


 * CodonAlign stand alone program

Genomic DNA against genomic DNA

 * MUMmer
 * slides
 * PipMaker aligns two DNA sequences
 * MultiPipMaker (web server) article Aligns 2-30 DNA sequences

GATA: a graphic alignment tool for comparative sequence analysis
 * GATA paper


 * AuberGene stand alone Python scripts. Requires blastz for pairs comparison.
 * SeqAn::T-Coffee multiple sequence alignment program. SeqAn::T-Coffee aligns amino acid, DNA and RNA sequences using a consistency-based progressive alignment algorithm on a graph of sequence segments.

Non-coding DNA sequences
Designed specially for non-coding DNA sequence Paper:
 * Sigma

Similarity plots (dot-plots)
For the explanation how it works look here


 * dotter(Unix, PC)

Create dot-plots (DNA-DNA, protein-protein and DNA-protein) between one or more sequences. Ideal for quick estimation of sequence similarity, gene organization (genomic DNA vs cDNA), domain finding for small/medium size sequences. One can also zoom an area of interest. Memory and CPU is the limit but one can compare 200kbp genomic DNA vs 6kbp cDNA in ca 2min on 2GHz Pentium 4.

dotter first_fasta_DNA.seq second_fasta_DNA.seq dotter first_fasta_AA.seq second_fasta_AA.seq dotter first_fasta_DNA.seq second_fasta_AA.seq Note that the order of DNA vs protein comparison matters. Try to keep sequence names within your multiple_fasta file short.


 * dottup info (Emboss, Unix/Web) web form


 * Gepard bacterial genome scale fast dotplots (Java)

Database searches (sequence similarity)
There is a tradeoff between finding all possible matches (like with Smith-Waterman) and speed. For most applications Blast (and Psi-Blast) is a first choice. For massive searches, where speed is essential (i.e. mapping all cDNAs to a mammalian genome) use blat

Regular expressions
Searches for short motifs, like [MC]-A-x-E(2)-{IE} (M or C followed by A followed by any followed by EE followed by anything but I or E)

search nucleotide searches fuzznuc -sequence your_sequence_for_searching.fasta -pattern 'GATTACCA' -mismatch number_of_mismatches_allowed -outfile output_file.txt search protein sequences fuzzpro -sequence your_sequence_for_searching.fasta -pattern 'MEGGADE' -mismatch number_of_mismatches_allowed -outfile output_file.txt
 * fuzznuc (from Emboss)
 * fuzzpro(from Emboss)

Regular expression search of a nucleotide sequence dreg -sequence your_sequence_for_searching.fasta -pattern '[CG](5)TG{A}N(1,5)C' -mismatch number_of_mismatches_allowed -outfile output_file.txt Regular expression search of a protein sequence preg -sequence your_sequence_for_searching.fasta -pattern '[MARE](2)YDY{AGWK}' -mismatch number_of_mismatches_allowed -outfile output_file.txt
 * dreg (from Emboss)-
 * preg (from Emboss)

For more help about options/switches in Emboss programs type: program_name -help -verbose [CG](5)TG{A}N(1,5)C
 * ScanProsite

corresponds to:

[C or G] (five times), T, G, {anything except A},1-5 of any, C

More about pattern syntax:

First tier algorithms
For the detailed BLAST tutorial see Wikiomics:BLAST_tutorial
 * Blast

Warning: There is a major change of blast standalone binaries. Info below is describing blastall based programs. While most of the options will work with new blast+, check the information here

Command line: check first if $BLASTDB and $BLASTMAT are set (it should display these names sans $ signs & pathways): env | grep BLAST

If you get a blank and these are not set in your '~/.ncbirc' file (tcsh): setenv $BLASTDB /path/to/directory/with/blast/databases/ setenv $BLASTMAT /path/to/your/director/blast/data or follow instructions for creating '~/.ncbirc' from here:

to create a database out of your fasta file: formatdb -i your_multiple_protein_file.fa -p T -V formatdb -i your_multiple_DNAseq_file.fa -p F -V For more options type 'formatdb -h'

blastall -p blastn -d ecoli_nn -i my_nn_fasta_file.fa -o my_nn_fasta_file_vs_ecolidb.blastn_out blastall -p blastp -d ecoli_aa -i my_nn_fasta_file.fa -o my_nn_fasta_file_vs_ecolidb.blastp_out
 * nucleotide search
 * protein search

finding similarity between two sequences/alignment + dotplot on the web f
 * Blast2

comparing protein sequence with DNA: bl2seq -p tblastn -F N -i your_peptide_sequence.fasta -j your_nucleotide_sequence.fasta
 * Blat
 * description: http://www.genome.org/cgi/content/full/12/4/656

blat /path/to/database your_query_in_fasta.fa [-ooc=11.ooc] output.psl -ooc=11.ooc option speeds up search by using a set of over-occuring 11-mers to create your database out of DNA sequence file: faToTwoBit input_fasta.fa [in2.fa in3.fa ...] output.2bit use output.2bit as a database in your blat searches.
 * web server: http://genome.ucsc.edu/cgi-bin/hgBlat?command=start
 * command line

Slower but more sensitive

 * FastA
 * Smith-Waterman/MPsrch
 * only Protien queries. Emboss water can be used to search local nucleotide databases.
 * Scanps (SW variant)

HMM and profile searches

 * PSI-BLAST


 * HMM searches

hmmbuild your_ALIGNED_sequences.hmm your_ALIGNED_sequences.fasta hmmcalibrate your_ALIGNED_sequences.hmm hmmsearch your_ALIGNED_sequences.hmm protein_database.fasta
 * hmmer

in HMMER3 you have to specify the format, i.e. /hmmbuild --amino --informat afa --cpu 8 your_ALIGNED_sequences.hmm your_ALIGNED_sequences.fasta

Command line: mast your_meme_motifs.meme -d your_database.multiple_fasta
 * MEME MAST [ http://meme.sdsc.edu/]

Data mining
finding basic info about known gene


 * EntrezGene
 * Omim
 * GeneCards
 * HomoloGene

Phylogeny
$$$ very preliminary at the moment $$$


 * 1) select a set of sequences
 * 2) create alignment
 * 3) remove bad regions of the alignment/gaps (tree constructing programs deal badly with indels)
 * 4) for more than 12 sequences tha space of all possible tree is to large to search one-by-one -> heuristic methods are used

Methods
sensitive to the order of sequences.
 * maximum parsimony MP


 * distance methods
 * Neighbor joining NJ
 * UPGMA (Unweighted Pair Group Method with Arithmetic mean) assumes rate of mutation constant along branches of a tree. Less accurate for sequence data than NJ.


 * maximum likehood (computationally intense, but more accurate)

To check validity of a tree: After creating a tree from MSA bootstrapping program selects number of columns from MSA and calculates a tree. This process is repeated 100-1000 times and resulting trees are compared with previously created one. Branches present in more than 70% of bootstrap samples are believed to have a good support. These estimates are conservative
 * bootstrap

Probably overestimate confidence values
 * multiple samples from the posterior distribution (MrBayes)

Programs
Individual programs can be scripted. On unix (bash): proml << EOF your_aligned_proteins.phylip 1 S y EOF
 * PHYLYP (phylogeny standard package) see a guide to it:
 * create a new directory
 * put your_aligned_proteins.phylip there
 * start any text editor, modify content below to suit your options, save as i.e. run_proml_with_my_options.sh
 * 1) !/usr/bin/sh

source run_proml_with_my_options.sh Output written to file "outfile" Tree also written onto file "outtree" requires NEXUS format. To tun it on command line: 'mb'
 * run it
 * after completein the run it reports:
 * PHYML http://atgc.lirmm.fr/phyml/ (ML)
 * MrBayes http://mrbayes.csit.fsu.edu/ (Bayes / Markov chain Monte Carlo.)
 * RAxML-VI http://www.ics.forth.gr/~stamatak/index-Dateien/Page443.htm (ML for large phylogenetic trees)
 * TREE-PUZZLE http://www.tree-puzzle.de/ (ML)
 * Mesquite collection of modules for performing a wide range of analyses from various branches of evolutionary biology (e.g., phylogenetics, molecular evolution, population genetics, geometric morphometrics)

Molecular clock

 * BEAST (Linux, Win, Mac) Bayesian MCMC analysis of molecular sequences

Tree display
For the comprehensive list see: http://en.wikipedia.org/wiki/List_of_phylogenetic_tree_visualization_software


 * Phylodendron
 * njplot
 * forester


 * PhyloWidget novel (2008) program (standalone + web java widget)

One gene in one species
Almost 20 species have as for today (May 2006) separate entries in Ensembl. Possible searches include: blast, GeneBank accession number, gene symbol, chromosomal band, marker etc. To elucidate genomic organization (exons/introns) of a gene starting from a known gene name/accession number.


 * Ensembl (known transcript)
 * 1) go to Ensembl
 * 2) click on a link to an organism of interest say Human (Homo sapiens)
 * 3) on the top  of the page, "Search for:" (select "Gene" from pull down menu)
 * 4) put the gene symbol (i.e. p53), Search
 * 5) select from the list of hits Ensembl gene (click on the red link)
 * 6) you are in "Ensembl Gene Report"
 * 7) right to "Sequence Markup" click on " View genomic sequence for this gene with exons highlighted"
 * 8) the sequence (including SNPs) is displayed


 * Spidey (novel transcript)
 * 1) get cDNA sequence in FASTA format
 * 2) check it for repetitive sequences using RepeatMasker
 * 3) get genomic sequence containing the gene with 1-10kb flanks
 * 4) check the gene structure using dotter this is to detect cDNA/genomic DNA miss-assembly/gaps etc.
 * 5) go to http://www.ncbi.nlm.nih.gov/IEB/Research/Ostell/Spidey/spideyweb.cgi
 * 6) fill in the form selecting the proper splice site matrix (so far: vertebrates, Drosophila, C. elegans, or plants)
 * 7) you get a table with genomic coordinates/mRNA coordinates, exon length, % of identity, number ofmismatches, gaps, precence or abcence of splice donor and acceptor sites

Very good at finding exons in the same species (i.e human cDNA/human genomic sequence), more problematic with cross-species comparisons (i.e. human cDNA, bovine genomic sequence -> most of the exons OK, with few bad predictions).

comand line only. Slow but good at finding small exons. So far has matrices for human (default), mouse, rice and C.elegans. exalin your_genomic_sequence.fa your_cDNA_sequence.fa Best for: comparing a protein sequence or hmm profile to a genomic/cDNA sequence from other organism Restrictions: fasta headers longer than 15 characters are truncated
 * exalin http://blast.wustl.edu/exalin/
 * wise2

One needs to set up env variable for wise, i.e bash: export WISECONFIGDIR=/usr/biosoft/packages/wise/wise2.2.3-rc7//wisecfg/

genewise your_peptide_sequence.single_fasta your_genomic_sequence.single_fasta genewise -hmmer your_hmm_profile.HMM your_genomic_sequence.single_fasta estwise your_peptide_sequence.single_fasta your_cDNA_sequence.single_fasta estwise -hmmer your_hmm_profile.HMM your_cDNA_sequence.single_fasta
 * peptide vs genomic DNA:
 * HMM profile vs genomic DNA:
 * peptide vs cDNA/EST
 * HMM profile vs cDNA/EST


 * '''GeneSeqer WWW server + stand alone program. Multiple species models.

standalone cDNA-to-genome alignment program
 *  Pairagon 


 * PALMA 2007 program stand alone claimed to outperform other algorithms. Parameter files for human, C.elegans, arabidopsis and drosophila (Nov 2007)

One gene in multiple species
Structured Visualization of Evolutionary Conserved Sequences. Provides colorful graphs depicting intron exon orientation with sizes.
 * SVC


 * 1) go to SVC
 * 2) put gene symbol (i.e. SYNGR3) or better Ensembl ID (i.e. ENSG00000127561) into 'Input ID'
 * 3) check species of interest in 'Search for orthologs in:'
 * 4) Click 'Submit' next to 'Input ID'

Bacterial genomics

 * SEED precomputed similarities, annotations using colocalization in operons. Very efective for annotation new genes.


 * Integr8 Ensembl-like interface to bacterial genomes


 * WebACT WebACT genome-2-genome comparisons

Promoter prediction
Predicting promoters based on a single sequence is rather unaccurate. Transcription factor (TF) recognition sequences are fairly short, therefore any sequence 1kbp in lenght will contain hundreds of such motifs, only a handful of which will be a true positives. There are two major methods used for finding relevant sites
 * phylogenetic fingerprinting (comparison of promoter regions between orthologues)
 * comparison of promoter regions of genes with similar expression profiles in the same species

We can look for either known TFs recognition sequences (quality of which depends on database used) or perform motif finding de novo, looking simply for any streach of DNA shared between species/co-expressed genes.

make sure that you got a real 5' end exon of the gene (a lot of longer genes in the databases may be missing the true 5'end). You verify quality of your gene of interest by: Interpretation of this one may be tricky, on the simplest level look for spliced matches with ORFs in the same orientation as your gene without clear stop codons.
 * before you start
 * comparing with orthologues ( Ensembl or Homologene) If close orhologues are longer on the protein level, look for the presumptive missing exon(s).
 * looking at the genomic context (other genes flanking yours)
 * blasting cDNAs 5' end against ESTs database looking for any ESTs extending your gene
 * blasting masked genomic sequence, from 5' end exon to the start/end of the flanking gene

After checking the 5' end of the gene, always look for repetitive sequences using RepeatMasker. It does not mean that say Alu/LINE1 can not be a part of the promoter. Just that you may concentrate on parts which are a bit more specific for the given gene.
 * find position of repeats

phylogenetic fingerprinting
comparisons of promoter regions of orthologous genes in other species
 * 1) obtain these using Ensembl
 * 2) use tools described below to find out shared TF binding sites/comon motifs

Caveats:
 * comparing closely related genomic sequences (human and chimp) is not stringent enough
 * comparing wildely different sequences (human and zebrafish) is too stringent for all but small set of genes (i.e. homeobox?)

comparing promoters of coexpressed genes in the same species

 * 1) use GEO (Gene Expression Omnibus)  to find gene(s) with simmilar expression
 * 2) get promoter sequences using Ensembl

Tools:

Shared TF binding sites
Quality depends on the underlying database of TF binding sites


 * Consite produces a set of transcription factors binding sites shared between two sequences
 * zpicture plus rVISTA
 * Conreal (multiple alignment algorithms)
 * Confac
 * Mulan multiple sequences
 * Footprinter

Shared DNA motifs only
For more in depth coverage see: Wikiomics:Sequence_motifs

This method will work even for unknown TF binding sites, but it may skip some more degenerate motifs.


 * MEME capable of discovering longer motifs, very sensitive
 * Improbizer takes into account position of the motifs
 * AlignACE Gibbs sampling algorithm
 * MDscan

It is important to use few complementary programs as well as allow predictions of several motifs per program. There is threshold when it comes to number of sequences used, above which there is no improvement in sensitivity.

Comparison of several programs: http://nar.oxfordjournals.org/cgi/content/full/33/15/4899#TBL1

Others:

Promoter Databases
 * Eukaryotic Promoter Database
 * Mammalian Promoter Database MPromDb
 * DoOP eukaryotic promoter sequences
 * DBTSS Database of Transcriptional Start Sites

Transcription Factiors Databases
 * JASPAR (open transcription factors database)
 * TESS(finding transcription factor binding sites within your sequence)

Be aware that simply listing all recognizable TF binding sites in a given stretch of DNA is almost of no value. Only 0.1% of the putative sites listed that way will be functional (Applied bioinformatics for the identification of regulatory elements WW Wasserman, A Sandelin - Nature Reviews Genetics, 2004)

Reviews & tutorials

 * gene regulation tools site very comprehensive


 * Other tutorial (you need an access to Nature Immunology):
 * PLOS Apr 2006 MacIsaac and Fraenkel article


 * For an extensive table of known programs used for finding motifs see

Transcription start finding
There are various clases of such programs starting with completa ab initio promoter finding using genomic sequence only, through combining it with ab initio gene fining, finally taking into account ESTs data and/or gene annotations. Most of the programs were evaluated on human or mammalian sequences and may not be suitable for other groups of organisms. Masking repeats using RepeatMasker improves performance of McPromoter(strongly) and FirstEF (mildly), whereas has no effect on Eponine and DragonGSF.

Programs

 * N-SCAN (standalone only derivative of TwinSCAN)
 * Eponine (mammalian)
 * FirstEF (human)
 * DragonGSF


 * McPromoter (fly, with older version for human)


 * PromH (promoters identification using orthologous genomic sequences)


 * ARTS (superior but not as Jan 2007 publicly available)
 * CoreBoost

Newest:
 * AMOSA
 * ProSOM (HTML)

Databases

 * EPD eukaryotic promoter database
 * DoOP eukaryotic promoter sequences


 * VISTA Enhancer Browser Human/vertebrates enhancers DB

Reviews
Genome Biology August 2006
 * Bajic et al. Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment
 * Bajic VB, Tan SL, Suzuki Y, Sugano S: Promoter prediction analysis on the whole human genome. Nat Biotechnol 2004,

SNPs (Single Nucleotide Polymorphism)

 * NCBI SNP database http://www.ncbi.nlm.nih.gov/SNP/ (strongly recommended)
 * HGVbase Phenotype/Genotype database (funding expired?)
 * JSNP http://snp.ims.u-tokyo.ac.jp/ (some access problems)
 * CSHL http://snp.cshl.org/

Unigene

 * 1) go to Unigene (ESTs derived from the gene)
 * 2) search Unigene with gene symbol (i.e SYNGR1) or Unigene cluster (i.e. SYNGR1)
 * 3) look at Expression and distribution of ESTs
 * 4) click on Expression profile to see electronic Northern

Symatlas
Wide array of tissues and genes.
 * go to Symatlas
 * put your gene symbol, i.e. SYNGR1
 * click on a small icon left to human or mouse genes

in situ hybridisation images
http://genex.hgu.mrc.ac.uk/Emage/database/emageIntro.html
 * EMAGE

Atlas of in situ hybridization of mouse brain
 * Gensat
 * http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gensat

Atlas of in situ hybridization in mouse embryos. One can order hybridization of a gene of interest.
 * Genepaint
 * http://www.genepaint.org/Frameset.html

Digital Differential Display
http://www.ncbi.nlm.nih.gov/UniGene/info_ddd.shtml

Searching info about gene homologues in other databases

 * Mouse
 * C.elegans & co.
 * D. melanogaster
 * S. cerevisiae SGD@stanford
 * Proteome (multiple organisms) ($$$) (databases of S. cerevisiae, S.pombe and WormPB [C.elegans+]

Finding protein domains

 * Pfam
 * SMART
 * DART
 * InterPro

Phosphorylation

 * PhosphoMotif Finder
 * ScanSite
 * NetPhos

GPI-Anchors

 * http://mendel.imp.ac.at/gpi/gpi_server.html

Disulfide bond predictions

 * DISULFIND Newest


 * DiANNA
 * PreCys

Sequence/structure retrieval

 * Entrez
 * Batch Entrez (retrieving multiple DNA/protein entries at once):
 * SRS (complex queries )
 * PROMPT Protein Mapping and Comparison Tool

Protein localization and structure prediction

 * collection of links:


 * TarO queries multiple servers

Start with:

 * Evaluating eukaryotic secreted protein prediction


 * Localizome server Localizome combines prediction of Phobius, Pfam, LocaloDom (database of TM topologies and TM helix numbers of around 3,000 eukaryotic Pfam domains)
 * The maximum number of input proteins in a single submission is 500 protein.
 * The length of each sequence is limited to 6,000 residues.

Standard choices

 * Psort
 * TargetP

In evaluation (localization):

 * MultiLoc
 * LOCSVMPSI (localization, results by email) site down 2010-03-11
 * BaCelLo(max five fasta sequences)
 * ProteinProwler  (plant/nonplant and four localizations only)

transmembrane

 * Recent review: Punta, Marco, Lucy R. Forrest, Henry Bigelow, Andrew Kernytsky, Jinfeng Liu, and Burkhard Rost. “Membrane protein prediction methods.” Methods 41, no. 4 (April 2007): 460-474.


 * top servers were evaluated by Cuthbertson et al.( Protein Engineering Design and Selection 2005) full text:
 * SPLIT4 (first choice)
 * TMHMM-2.0
 * HMMTOP 2
 * TopPred


 * Other servers:
 * ConPred2 Consensus TM prediction (metaserver)
 * DAS-TM
 * Phobius (includes signal peptide predictions)
 * PRODIV-TMHMM ref: (results send by email, prediction of TM topology)
 * MemBrain (includes signal peptide predictions) claims to outperform several other top scoring TM predicting programs paper(HTML)

see also links here: and the Signal Peptide section

signal peptides
Accepts both single fasts seqence as well as aligned sequences in multiple fasta format (only the top sequence appears in the output).
 * SignalP
 * Phobius (predicts signal peptide, TM and orientation)
 * PolyPhobius Homology supported signal peptide predictions


 * Phobius author PhD thesis:

Other Programs:
 * SPEPlip Eukaryots/Prokaryots
 * TatP Twin-arginine signal peptide cleavage sites in bacteria Paper:
 * LipoP lipoprotein signal peptides in Gram-negative bacteria. Paper:

protein structure (no-3D)

 * Expasy
 * Coils (coiled coil prediction)
 * Protscale (various prediction based on AA sequence)
 * Jpred (multiple secondary structure predictions run in parallel)

Graphical plots of protein 2D structure
' 'One-step' data input procedure, including easy querying of the public databases and incorporation of output from various popular prediction tools available from the Web.' allows creating graphs with transmembrane regions, signal peptides, disulfide-bridhes, glicosylated residues etc.
 * TMRPres2D standalone Java program. Top choice:


 * TOPO2 web based

Protein structure databases/comparisons

 * PDB
 * NCBI
 * CATH
 * DALI
 * SCOP

Metabolic networks
$$ needs more work$$
 * Reactome
 * KEGG
 * BioCyc
 * Cyclone - provides an open source Java API for easier access to BioCyc.
 * RegulonDB E.coli K12 DB (operons/genes/regulatory elements)

Protein-protein interactions
Large scale methods such as yeast-two-hybrids (Y2H), immuno-co-precipiatation (IP) produce a large number of false positives. Only about 30-50% of interactions seem to be relevant. While newer improvements of these procedures tend to be more accurate, existing databases struggle to separate good quality data.


 * IntAct@EBI literature and user submission data.


 * DIP
 * STRING
 * BindingDB
 * BioGRID
 * Mammalian Protein-Protein Interaction Database @MIPS
 * STITCH Chemical-Protein Interactions

Restricted access ($$$ ?):
 * BIND / BONDplus (registration required)

Tools
 * APID Agile Protein Interaction DataAnalyzer

Papers
 * http://www.nature.com/nbt/journal/v22/n1/full/nbt924.html

Where to find more information

 * Bioinformatics Links Directory @UBiC


 * Approaches to Web Development for Bioinformatics
 * Bioinformatics Resources - Medical Computing .Net

Also there are relevant pages on Wikiomics web site:
 * Wikiomics:BLAST_tutorial


 * Wikiomics:Bioinformatics courses(external)


 * Wikiomics:SRS_tutorials
 * Wikiomics:Ensembl_tutorials


 * Wikiomics:Staden
 * Wikiomics:Cloning_in_silico


 * Wikiomics:ChIP-chip
 * Wikiomics:Protein mass spectrometry
 * Wikiomics:Sequence motifs


 * Wikiomics:NMR_Toolbox
 * Wikiomics:Crystallography_toolbox
 * Wikiomics:Docking_toolbox

Credits

 * Darek Kedra wrote this tutorial