Wikiomics:Bioinfo tutorial

Introduction

The goal of this document is to provide a set of illustrative cookbook recipes for wet lab biologists. As there is more than one way of performing a given task, it does point to different tools, but it does not even try to become a yet another directory of links.

You are strongly encouraged to post your comments/ideas on the discussion page of this tutorial.

Before you start

use text files/plain e-mail whenever possible
give meaningful names to your files (i.e. humP53_XP12345.fas)
do not use spaces or Unix special characters in file names (!?"'%&^~*$|/\{}[]()<>:)
create separate folders/directories for each project with meaningful names
do not waste paper for printing out your blast search results or sequences
if primary data needs to be archived burn it on CD using (if possible) non vendor-specific data format

Operating systems/General software

For installing standalone bioinformatics programs the order of preference is: Linux/Unix > MacOS > Windows

Once you move from worksation to bioinformatics application server (almost always Linux/Unix) then as long as you have a Xwindows connection (preferably ssh tunneled) operating system of client workstation plays a minor (if any) role.

DNA sequence processing

Sequence formats

There are multiple sequence formats commonly used. These can be divided into text-only and binary formats.

Text-only

Unaligned sequences

Raw sequence ( example) The text file contains nothing but sequence. The information that's inside has to be encoded in the name of the file. Some web servers still require it.

FASTA (description, example) The most commonly used format (by programs/servers). It can contain either single or multiple sequences, each having a name line, also known as the header, starting with ">" in the first column. Keep in mind that various programs/servers introduce extra restrictions how the FastA header should look like. The "safe" standard:
- Alphanumeric characters only [a-z, A-Z, 0-9], avoid spaces!
- Keep the names short [i.e. 30 characters are significant for clustalw, only 8 for older versions of t_coffee]
- In multiple FastA format keep the names unique

Whenever you plan to analyze your sequence using web servers or command line programs, keep it as FASTA. It is easy to cut the raw sequence out of it, and if your file name/sequence names contain accession numbers to find the GenBank entry.

GenBank format of choice used by GenBank to display extra information about the sequence. It is fairly complex, but easily readable by humans. Keep your sequences as GenBank when you need extra annotations provided by this format.

GCG GCG package own format. If you need to use GCG, you will have to convert your sequences into it

XML eXtended Markeup Language. Standard language slowly taking over the Net. XML files are computer-readable (= information is fairly easy to extract from them) and possible to read by humans. Format of choice when complex sequence information need to be described and computer accessible.

EXP format used by Wikiomics:Staden sequencing package. Contains both sequence and annotations understandable by Wikiomics:gap4.

GFF format for (mostly) genomic sequence annotations used by Ensembl [1]

Aligned sequences

MSF Multiple Sequence Format. Used as output for multiple sequence alignments (pileup from GCG) and as input for alignment editors and shadowing programs.

ALN standard for multiple sequence alignments. Output format of clustalw and t_coffee.

PHYLIP used as an input in such programs like PHYLIP package, phyml

NEXUS advanced format allowing more annotation, used by MrBayes

Binary

ABI binary output format of ABI sequencers. Bulky (= unnecessary large), but so far the only one readable on Mac-OS. Can be easily converted into SCF.

SCF Sequence Common Format. Standard output for ALF sequencers. More compact than ABI, and commonly used in sequencing projects. Readable on PC and Unix, binary format of the Staden package.

ZTR successor of SCF with internal data compression. Currently the default binary format for Staden.

Sequence conversion tools

readseq

OK for simple format, has problems with aligned sequences

sreformat from SQUID

A function in Sean Eddy's SQUID package for converting between unaligned and aligned sequence formats.

BioPerl::SeqIO

can be used for sequence conversion, also for many other bioinformatic related functions.

for convertions between aligned formats (ALN, MSF, NEXUS, PHYLIP) use:

SeaView GUI program
ToPhylip (WWW) ALN and MSF -> Phylip

Processing ABI files and sequence assembly

Before you even start sequencing, plan the way the sequences will be processed. The most important is to establish a proper naming scheme compatible with various programs you (or others) may use downstream for sequence processing. The basic rules for file names are:

use alphanumeric characters [a-z, A-Z, 0-9]
use underscore "_" instead of spaces
full stop shall separate information about sequencing template and primer, like:

prefix_my-template.my-primer_reaction-number

Even small sized projects become cumbersome if file names are not unique (you can not assembly files having the same name!). Once these requirements are satisfied, file name alone will give you enough information about:

whose sample is it?
to which sequencing project does it belong?
what was the sequencing template?
which primer was used?
was this sample sequenced before?

Programs for DNA sequence assembly

Fasta files (no traces)

phrap

phrap your_multiple_fasta_file.fa

Results in: your_multiple_fasta_file.fa.contigs

Used for i.e assembly of ESTs.

ABI/SCF files (traces)

recommended:

Staden package (Unix, Windows, OSX)

see also another Wikiomics page Wikiomics:Staden

- pregap4 (sequence pre-processing)

Takes a directory or a list of files for preprocessing (creating EXP files, clipping bad quality sequence, marking sequencing vector, repeats etc.

Creating sequence out of ABI chromatogram file is done by a basecalling program. One can go along with ABI-provided basecaller or preferably use:

phred [2] giving more accurate calls for less accurate part of the sequence (like at the end of the run, say 600bp and more) . Phred also gives a probability/quality values for each of the bases allowing more accurate assembly.

pregap4 %

- gap4 (sequence assembly, editing, verification and much more)

gap4 &

or (to open a specific database at startup):

gap4 database.version.aux &

Other packages:

Consed Consed
Sequencher [3]

Vector masking

EMVEC (web and blast2 based) [4]

Good for a single sequence.

vectorstrip (Emboss)

vectorstrip -sequence  my_fasta_file -vectors my_vectors_fasta_file -mismatch 30
   -linkera TAGGATACAG -linkerb TTTTTCAGGACAGACA

linkera: 5' sequence
linkerb: 3' sequence

cross_match (from phrap package)

cross_match my_fasta_file my_vectors_fasta_file -minmatch 10 -minscore 20 -screen > screen.out

You will get a file: my_fasta_file.screen --- "devectorised" my_fasta_file

For masking out, let's say, one gene/overlapping clone, etc., type:

cross_match fasta_file_to_be_masked masking_sequence -minmatch 10 -minscore 20 -screen > screen.out

You will get a file: fasta_file_to_be_masked.screen without the sequences matching masking_sequence.

Figaro

A novel approach for finding putative vector/adapter sequences without knowledge of the vector sequence. "vector sequence is automatically inferred by analyzing the frequency of occurrence of short oligo-nucleotides using Poisson statistics" http://sourceforge.net/apps/mediawiki/amos/index.php?title=Figaro

Masking repetitive sequences using RepeatMasker (WWW/Unix)

web site: http://www.repeatmasker.org/
current version (checked on 2010-03.22): 3.2.8
documentation: http://www.repeatmasker.org/webrepeatmaskerhelp.html

Online web server [5]

command line

You have to have a FastA file (it can be multiple FastA). Type:

repmask your_sequence_in_fasta_format

You will get a file: your_sequence_in_fasta_format.masked --- name tells all

species options (choose only one):

-m(us) masks rodent specific and mammalian wide repeats
-rod(ent) same as -mus
-mam(mal) masks repeats found in non-primate, non-rodent mammals
-ar(abidopsis) masks repeats found in Arabidopsis
-dr(osophila) masks repeats found in Drosophilas
-el(egans) masks repeats found in C. elegans

Primer design

General info about primer design/PCR:

http://www.mcb.uct.ac.za/pcroptim.htm

online:

Primer3plus
- BatchPrimer3 derivative program for designing multiple primers at once

Pythia new software claiming superiority over primer3

Oligo (PC & Mac ) $$$, first choice for small projects

JCVI Primer Designer pipeline primer design for DNA sequencing projects

Also search http://sourceforge.net/ with i.e. "primer pcr" for another, more task specific programs.

Primer Select from Lasergene (PC & Mac)

DNA -> protein translation

transeq (Emboss) WWW server
Protein Translate (ExPasy): WWW server
ORF Finder WWW server

Cloning in silico

ESTs based

Pre-clustered ESTs:

STACKdb (limited access, tissue specific splice forms) [6]
Unigene (no consensus sequence) [7]
TIGR Gene Indexes

Genome based

based on homology
de novo

see also separate page: Wikiomics:Cloning_in_silico

Genome annotation using ESTs assembly

PASA tool used in many current genome annotation projects

Synthetic genes

Many genes due to an extreme codon bias can not be expressed in model organisms, such as E.coli. To fix this problem one can back-translate protein sequence into nucleotides, using codons preferably used by E.coli. Moreover one can construct sucgh gene from a set of overlapping oligos, introducing restriction sites or other features at will. Programs performing such tasks:

JCAT http://www.jcat.de/
GeneDesign
SGD Syntetic Gene Designer
Synthetic Gene DataBase [8]
stand alone program UpGene (SmallTalk) http://www.vectorcore.pitt.edu/upgene/upgene.html

Codon preference in Pichia pastoris

http://www.microbialcellfactories.com/content/4/1/33

Ref

Applied Microbiology and Biotechnology
Issue:  Volume 67, Number 3
Date:  May 2005
Pages: 289 - 298
Strategies for efficient production of heterologous proteins in Escherichia coli
S. Jana1 and J. K. Deb

Codon usage

There is a bias in codon usage resulting in variation in translation speed/efficiency. The bias itself varies between organisms (i.e. it is strong in yeast, then in E.coli, much weaker in B.subtilis).

codon usage database [9]
graphical codon usage analysis http://gcua.schoedl.de/

One can calculate a number of codon bias indexes using programs:

CodonW web server, Sourceforge project

E-CAI HTML "The E-CAI server provides a direct threshold value for discerning whether the differences in CAI are statistically significant or whether they are merely artifacts that arise from internal biases in the G+C composition and/or amino acid composition of the query sequences."

RNA

The Vienna RNA package has solutions for many classical RNA problems. It well worth taking some time to familiarize oneself with the algorithms in this package before checking out the alternative methods.

Secondary structure prediction

For Wikiomics:RNA secondary structure prediction from a single sequence the most commonly used tools are MFold and RNAfold. For more single sequence and other comparative methods a large range of methods are available.

For the best accuracy (and evolutionary information) comparative approaches, in particular the alignment folding methods, have proven rather successful. Typically a researcher will align sequences using tools such as ClustalW, MAFFT and ProAlign and then fold the alignment using tools such as RNAalifold or Pfold.

RNA homology search

For the Wikiomics:RNA homology search problem (including secondary structure) a number of algorithms exist. E.g. Infernal and RaveNnA. A number of tools specializing in a specific RNA family are also available eg. tRNA, miRNA etc.

RNA gene-finding

Tools for Wikiomics:RNA gene-finding usually use as input sequence alignmments which are then scanned for potential conserved secondary structure signals. Useful examples of these are QRNA, RNAz and EvoFold.

RNA databases

A number of Wikiomics:RNA databases are available.

microRNA database (sequences, targets, sumbmission for publications)
Rfam a large collection of multiple sequence alignments and covariance models covering many common non-coding RNA families.
BRaliBase Benchmarks of RNA analysis tools on trusted ncRNA datasets.

Protein Alignment

See also article in wikipedia: Sequence alignment software

Two sequences

There are two kinds of alignment: local and global.

Local: finds the best region of similarity between two sequences

water (Emboss) uses the Smith-Waterman algorithm (modified for speed) web interface
lalign Huang and Miller algorithm [10]
blast2 [11]
bestfit (GCG) Smith-Waterman algorithm

Global: covers the whole length of both sequences

needle (Emboss) Needleman-Wunsch global alignment [12]
gap (GCG) Needleman-Wunsch global alignment

Multiple sequences

For a recent review see Notredame 2007 @PLOS

There is a trade off between accuracy and speed. For speed select MUSCLE. Accuracy of algorithms differs, with the best results on average being achieved using PRALINE and t-coffee/M-coffee. For poteins with multiple/repeated domains you may try graph based algorithms: poa or aba.

t_coffee (good at producing alignments of distant (~30% similarity) sequences. [13]

t_coffee your_fasta_file.fa

T-Coffee has a limit of 50 sequences. The new version is called M-Coffee, it is a meta-aligner using output of several other algorithms (ClustalW, Poa, Muscle, ProbCons, MAFFT, Dialign-T, PCMA). To run it (bash):

export POA_DIR=/absolute/path/to/Poa/installation/directory
export DIALIGNT_DIR=/absolute/path/to/DIALIGN-T/conf
t_coffee your_fasta_file.fa -special_mode mcoffee

MAFFT one of the top alignment programs, with multiple options. See benchmarks from 2006 papers here

PRALINE (slow but takes into account blast results) [14]

Uses PSI-BLAST for finding most conserved residues in each of the sequences. Very slow but could be more accurate as compared with other methods.

clustalw (standard algorithm most commonly used and "good enough" in most cases) [15]

clustalw your_fasta_file.fa

output as aln format

dialign (good for creating alignments with long gaps as with non-orthologues sharing domains only) [16]

Improved, command line version is DIALIGN-T [17]

dialign-t ./path/to/dialign-t/conf/ your_fasta_file.fa your_fasta_file.dialign-t

output is in fasta format

MUSCLE http://www.ebi.ac.uk/muscle/

muscle -clw -in your_fasta_file.fa -out your_fasta_file.aln -clw

probcons http://probcons.stanford.edu/

probcons -clustalw -ir 100 -pre 20 -a your_sequence_in_fasta_format.fa > probcons.aln

poa

Graph based algorithm replacing columns with directed acyclic graph. Permits domain recombinations of domains/exons: useful for the alignment of multidomain proteins and ESTs.

 poa -read_fasta your_fasta_file.fa  -clustal output_file.aln /path/to/similarity/matrix/i.e/blosum80.mat

aba

Method using A-Bruijn graph. Good for alignment of proteins with:

- domains that are not present in all proteins
- domains that are present in different orders in different proteins
- domains that are present in multiple copies in some proteins

DNA: ABA is useful in the alignment of genomic sequences that contain duplications and inversions. Based on: http://www.genome.org/cgi/content/abstract/14/11/2336

MEME (good for finding as-yet-unknown domains in a set of proteins) MEME web site

meme -nmotifs numberof_motifs_2_find -model oops  -protein -sequence your_fasta_file.fa.fa -outfile your_fasta_file.fa.fa.meme

Where 'model' could be oops/zoops/anr.

oops: One Occurrence Per Sequence

zoops: Zero or One Occurrence Per Sequence

anr: Any Number of Repetitions

MEME syntax (see more at http://meme.nbcr.net/meme4_3_0/documentation.html):

meme your_input_fasta_file.fas         -nmotifs 20 -mod anr  -dna     > your_meme_output.htm &

Alignment accuracy assessment

MUMSA (WWW)

all alignments have to in aligned fasta format and contain the same sequences in the same order.

Paper: http://nar.oxfordjournals.org/cgi/content/full/33/22/7120

Alignment editors & beautifiers

Jalview (Java) [18]

good for sequence conversion (ALN, MSF, Fasta,Pfam, PIR, BLC)

clustalx (PC, Unix) [19]

ALN only

seaview (Linux) [20]

good for sequence conversion (ALN, MSF, Fasta, Mase, Phylip, Nexus)

genedoc (PC only) genedoc(PC only) very flexible color schemes

boxshade (no editing, pretty prints only)

ALN, MSF

TeXshade

LaTeX package. accepts ALN and MSF formats.

WAViS web server
- accepts multi FASTA with Clustal, NEXUS, GCG, Phylip conversion
- EPS SVG output

Sequence logos

MSA or matrix

enoLOGOS (first choice)

MSA

WebLogo (input as Fasta, ALN formats)

HMM

LogoMat-M http://www.sanger.ac.uk/cgi-bin/software/analysis/logomat-m.cgi

(single HMM profile)

LogoMat-P http://www.sanger.ac.uk/cgi-bin/software/analysis/logomat-p.cgi

(pair-wise HMM profiles)

DNA alignment

coding DNA alignment based on existing protein alignment

This is a method of constructing more accurate DNA alignment for downstream use in phylogenic analyses taking into account coding properties of given DNA sequences.

PAL2NAL webserver capable of dealing with stop codons7frame shifts, suitable also for pseudogenes

transAlign Perl script, requires Clustalw article
RevTrans
protal2dna

CodonAlign stand alone program

Genomic DNA against genomic DNA

MUMmer
- slides
PipMaker aligns two DNA sequences
MultiPipMaker (web server) article Aligns 2-30 DNA sequences

GATA paper

GATA: a graphic alignment tool for comparative sequence analysis

AuberGene stand alone Python scripts. Requires blastz for pairs comparison.
SeqAn::T-Coffee multiple sequence alignment program. SeqAn::T-Coffee aligns amino acid, DNA and RNA sequences using a consistency-based progressive alignment algorithm on a graph of sequence segments.

Non-coding DNA sequences

Sigma [21]

Designed specially for non-coding DNA sequence Paper: [22]

Similarity plots (dot-plots)

For the explanation how it works look here

dotter(Unix, PC)

Create dot-plots (DNA-DNA, protein-protein and DNA-protein) between one or more sequences. Ideal for quick estimation of sequence similarity, gene organization (genomic DNA vs cDNA), domain finding for small/medium size sequences. One can also zoom an area of interest. Memory and CPU is the limit but one can compare 200kbp genomic DNA vs 6kbp cDNA in ca 2min on 2GHz Pentium 4.

dotter first_fasta_DNA.seq second_fasta_DNA.seq
dotter first_fasta_AA.seq second_fasta_AA.seq
dotter first_fasta_DNA.seq second_fasta_AA.seq

Note that the order of DNA vs protein comparison matters. Try to keep sequence names within your multiple_fasta file short.

Scaling down image 10x with -z option:

dotter first_fasta_DNA.seq second_fasta_DNA.seq -z 10

Detailed description how to add annotation to dotplots: [23]

The example format of sequence features file (passed with -f option to dotter)

100	@1	CDS	1	1124	RED	gene1
100	@1	CDS	3360	3965	RED	gene2
100	@1	CDS	3360	3968	RED	gene3

dottup info (Emboss, Unix/Web) web form

Gepard bacterial genome scale fast dotplots (Java)

Database searches (sequence similarity)

There is a tradeoff between finding all possible matches (like with Smith-Waterman) and speed. For most applications Blast (and Psi-Blast) is a first choice. For massive searches, where speed is essential (i.e. mapping all cDNAs to a mammalian genome) use blat

Regular expressions

Searches for short motifs, like [MC]-A-x-E(2)-{IE} (M or C followed by A followed by any followed by EE followed by anything but I or E)

fuzznuc (from Emboss)

search nucleotide searches

fuzznuc -sequence your_sequence_for_searching.fasta -pattern 'GATTACCA' -mismatch number_of_mismatches_allowed -outfile output_file.txt

fuzzpro(from Emboss)

search protein sequences

fuzzpro -sequence your_sequence_for_searching.fasta -pattern 'MEGGADE' -mismatch number_of_mismatches_allowed -outfile output_file.txt

dreg (from Emboss)-

Regular expression search of a nucleotide sequence

dreg -sequence your_sequence_for_searching.fasta -pattern '[CG](5)TG{A}N(1,5)C' -mismatch number_of_mismatches_allowed -outfile output_file.txt

preg (from Emboss)

Regular expression search of a protein sequence

preg -sequence your_sequence_for_searching.fasta -pattern '[MARE](2)YDY{AGWK}' -mismatch number_of_mismatches_allowed -outfile output_file.txt

For more help about options/switches in Emboss programs type:

program_name -help -verbose

ScanProsite [24]

[CG](5)TG{A}N(1,5)C

corresponds to:

[C or G] (five times), T, G, {anything except A},1-5 of any, C

More about pattern syntax: [25]

First tier algorithms

Blast [26]

For the detailed BLAST tutorial see Wikiomics:BLAST_tutorial

Warning: There is a major change of blast standalone binaries. Info below is describing blastall based programs. While most of the options will work with new blast+ , check the information here

Command line: check first if $BLASTDB and $BLASTMAT are set (it should display these names sans $ signs & pathways):

env | grep BLAST

If you get a blank and these are not set in your '~/.ncbirc' file (tcsh):

setenv $BLASTDB /path/to/directory/with/blast/databases/
setenv $BLASTMAT /path/to/your/director/blast/data

or follow instructions for creating '~/.ncbirc' from here: [27]

to create a database out of your fasta file:

formatdb -i your_multiple_protein_file.fa -p T -V
formatdb -i your_multiple_DNAseq_file.fa  -p F -V

For more options type 'formatdb -h'

- nucleotide search

blastall -p blastn -d ecoli_nn -i my_nn_fasta_file.fa -o my_nn_fasta_file_vs_ecolidb.blastn_out

- protein search

blastall -p blastp -d ecoli_aa -i my_nn_fasta_file.fa -o my_nn_fasta_file_vs_ecolidb.blastp_out

Blast2 [28]

finding similarity between two sequences/alignment + dotplot on the web f

comparing protein sequence with DNA:

bl2seq -p tblastn -F N -i your_peptide_sequence.fasta -j your_nucleotide_sequence.fasta

Blat
- description: http://www.genome.org/cgi/content/full/12/4/656

- web server: http://genome.ucsc.edu/cgi-bin/hgBlat?command=start
- command line

blat /path/to/database your_query_in_fasta.fa [-ooc=11.ooc] output.psl

-ooc=11.ooc option speeds up search by using a set of over-occuring 11-mers to create your database out of DNA sequence file:

faToTwoBit input_fasta.fa [in2.fa in3.fa ...] output.2bit

use output.2bit as a database in your blat searches.

Slower but more sensitive

FastA [29]
Smith-Waterman/MPsrch [30]
- only Protien queries. Emboss water [31] can be used to search local nucleotide databases .
Scanps (SW variant) [32]

HMM and profile searches

PSI-BLAST

HMM searches

- hmmer [33]

hmmbuild your_ALIGNED_sequences.hmm your_ALIGNED_sequences.fasta
hmmcalibrate your_ALIGNED_sequences.hmm
hmmsearch your_ALIGNED_sequences.hmm protein_database.fasta

in HMMER3 you have to specify the format, i.e.

/hmmbuild --amino --informat afa --cpu 8 your_ALIGNED_sequences.hmm your_ALIGNED_sequences.fasta

- MEME MAST [ http://meme.sdsc.edu/]

Command line:

mast your_meme_motifs.meme -d your_database.multiple_fasta

Data mining

finding basic info about known gene

EntrezGene [34]
Omim [35]
GeneCards [36]
HomoloGene [37]

Phylogeny

$$$ very preliminary at the moment $$$

select a set of sequences
create alignment
remove bad regions of the alignment/gaps (tree constructing programs deal badly with indels)
for more than 12 sequences tha space of all possible tree is to large to search one-by-one -> heuristic methods are used

Methods

maximum parsimony MP

sensitive to the order of sequences.

distance methods
- Neighbor joining NJ
- UPGMA (Unweighted Pair Group Method with Arithmetic mean) assumes rate of mutation constant along branches of a tree. Less accurate for sequence data than NJ.

maximum likehood (computationally intense, but more accurate)

To check validity of a tree:

bootstrap

After creating a tree from MSA bootstrapping program selects number of columns from MSA and calculates a tree. This process is repeated 100-1000 times and resulting trees are compared with previously created one. Branches present in more than 70% of bootstrap samples are believed to have a good support. These estimates are conservative

multiple samples from the posterior distribution (MrBayes)

Probably overestimate confidence values [38]

Programs

PHYLYP (phylogeny standard package) see a guide to it: [39]

Individual programs can be scripted. On unix (bash):

- create a new directory
- put your_aligned_proteins.phylip there
- start any text editor, modify content below to suit your options, save as i.e. run_proml_with_my_options.sh

#!/usr/bin/sh
proml << EOF
your_aligned_proteins.phylip
1 
S
y
EOF

- run it

source run_proml_with_my_options.sh

- after completein the run it reports:

Output written to file "outfile"
Tree also written onto file "outtree"

PHYML http://atgc.lirmm.fr/phyml/ (ML)
MrBayes http://mrbayes.csit.fsu.edu/ (Bayes / Markov chain Monte Carlo.)

 requires NEXUS format. To tun it on command line: 'mb'

RAxML-VI http://www.ics.forth.gr/~stamatak/index-Dateien/Page443.htm (ML for large phylogenetic trees)
TREE-PUZZLE http://www.tree-puzzle.de/ (ML)
Mesquite collection of modules for performing a wide range of analyses from various branches of evolutionary biology (e.g., phylogenetics, molecular evolution, population genetics, geometric morphometrics)

Molecular clock

BEAST (Linux, Win, Mac) Bayesian MCMC analysis of molecular sequences

Tree display

For the comprehensive list see: http://en.wikipedia.org/wiki/List_of_phylogenetic_tree_visualization_software

PhyloWidget novel (2008) program (standalone + web java widget)

References

extensive list of programs: http://evolution.genetics.washington.edu/phylip/software.html
tutorial articles:
- Phylogeny for the faint of heart: a tutorial. Sandra L. Baldauf.

- http://scholar.google.com/url?sa=U&q=http://www-edlab.cs.umass.edu/cs691k/blanchard/HolderLewis2003.pdf

Genomic structure

One gene in one species

Almost 20 species have as for today (May 2006) separate entries in Ensembl. Possible searches include: blast, GeneBank accession number, gene symbol, chromosomal band, marker etc. To elucidate genomic organization (exons/introns) of a gene starting from a known gene name/accession number.

Ensembl (known transcript)

go to Ensembl
click on a link to an organism of interest say Human (Homo sapiens)
on the top of the page, "Search for:" (select "Gene" from pull down menu)
put the gene symbol (i.e. p53), Search
select from the list of hits Ensembl gene (click on the red link)
you are in "Ensembl Gene Report"
right to "Sequence Markup" click on " View genomic sequence for this gene with exons highlighted"
the sequence (including SNPs) is displayed

Spidey (novel transcript)

get cDNA sequence in FASTA format
check it for repetitive sequences using RepeatMasker
get genomic sequence containing the gene with 1-10kb flanks
check the gene structure using dotter this is to detect cDNA/genomic DNA miss-assembly/gaps etc.
go to http://www.ncbi.nlm.nih.gov/IEB/Research/Ostell/Spidey/spideyweb.cgi
fill in the form selecting the proper splice site matrix (so far: vertebrates, Drosophila, C. elegans, or plants)
you get a table with genomic coordinates/mRNA coordinates, exon length, % of identity, number ofmismatches, gaps, precence or abcence of splice donor and acceptor sites

Very good at finding exons in the same species (i.e human cDNA/human genomic sequence), more problematic with cross-species comparisons (i.e. human cDNA, bovine genomic sequence -> most of the exons OK, with few bad predictions).

exalin http://blast.wustl.edu/exalin/

comand line only. Slow but good at finding small exons. So far has matrices for human (default), mouse, rice and C.elegans.

exalin your_genomic_sequence.fa  your_cDNA_sequence.fa

wise2 [40]

Best for: comparing a protein sequence or hmm profile to a genomic/cDNA sequence from other organism Restrictions: fasta headers longer than 15 characters are truncated

One needs to set up env variable for wise, i.e bash:

export WISECONFIGDIR=/usr/biosoft/packages/wise/wise2.2.3-rc7//wisecfg/

- - peptide vs genomic DNA:

genewise your_peptide_sequence.single_fasta  your_genomic_sequence.single_fasta

- - HMM profile vs genomic DNA:

genewise -hmmer your_hmm_profile.HMM your_genomic_sequence.single_fasta

- - peptide vs cDNA/EST

estwise your_peptide_sequence.single_fasta your_cDNA_sequence.single_fasta

- - HMM profile vs cDNA/EST

estwise -hmmer  your_hmm_profile.HMM your_cDNA_sequence.single_fasta

GeneSeqer WWW server + stand alone program. Multiple species models.

Pairagon [41]

standalone cDNA-to-genome alignment program

PALMA 2007 program stand alone claimed to outperform other algorithms. Parameter files for human, C.elegans, arabidopsis and drosophila (Nov 2007)

One gene in multiple species

SVC

Structured Visualization of Evolutionary Conserved Sequences. Provides colorful graphs depicting intron exon orientation with sizes.

go to SVC
put gene symbol (i.e. SYNGR3) or better Ensembl ID (i.e. ENSG00000127561) into 'Input ID'
check species of interest in 'Search for orthologs in:'
Click 'Submit' next to 'Input ID'

Bacterial genomics

SEED precomputed similarities, annotations using colocalization in operons. Very efective for annotation new genes.

Integr8 Ensembl-like interface to bacterial genomes

WebACT WebACT genome-2-genome comparisons

Promoter prediction

Predicting promoters based on a single sequence is rather unaccurate. Transcription factor (TF) recognition sequences are fairly short, therefore any sequence 1kbp in lenght will contain hundreds of such motifs, only a handful of which will be a true positives. There are two major methods used for finding relevant sites

phylogenetic fingerprinting (comparison of promoter regions between orthologues)
comparison of promoter regions of genes with similar expression profiles in the same species

We can look for either known TFs recognition sequences (quality of which depends on database used) or perform motif finding de novo, looking simply for any streach of DNA shared between species/co-expressed genes.

before you start

make sure that you got a real 5' end exon of the gene (a lot of longer genes in the databases may be missing the true 5'end). You verify quality of your gene of interest by:

- comparing with orthologues ( Ensembl or Homologene) If close orhologues are longer on the protein level, look for the presumptive missing exon(s).
- looking at the genomic context (other genes flanking yours)
- blasting cDNAs 5' end against ESTs database looking for any ESTs extending your gene
- blasting masked genomic sequence, from 5' end exon to the start/end of the flanking gene

Interpretation of this one may be tricky, on the simplest level look for spliced matches with ORFs in the same orientation as your gene without clear stop codons.

find position of repeats

After checking the 5' end of the gene, always look for repetitive sequences using RepeatMasker. It does not mean that say Alu/LINE1 can not be a part of the promoter. Just that you may concentrate on parts which are a bit more specific for the given gene.

phylogenetic fingerprinting

comparisons of promoter regions of orthologous genes in other species

obtain these using Ensembl [42]
use tools described below to find out shared TF binding sites/comon motifs

Caveats:

comparing closely related genomic sequences (human and chimp) is not stringent enough
comparing wildely different sequences (human and zebrafish) is too stringent for all but small set of genes (i.e. homeobox?)

comparing promoters of coexpressed genes in the same species

use GEO (Gene Expression Omnibus) [43] to find gene(s) with simmilar expression
get promoter sequences using Ensembl [44]

Tools:

Shared TF binding sites

Quality depends on the underlying database of TF binding sites

Consite produces a set of transcription factors binding sites shared between two sequences
zpicture plus rVISTA
Conreal (multiple alignment algorithms) [45]
Confac [46]
Mulan multiple sequences
Footprinter [47]

Shared DNA motifs only

For more in depth coverage see: Wikiomics:Sequence_motifs

This method will work even for unknown TF binding sites, but it may skip some more degenerate motifs.

MEME capable of discovering longer motifs, very sensitive
Improbizer takes into account position of the motifs
AlignACE Gibbs sampling algorithm
MDscan

It is important to use few complementary programs as well as allow predictions of several motifs per program. There is threshold when it comes to number of sequences used, above which there is no improvement in sensitivity.

Comparison of several programs: http://nar.oxfordjournals.org/cgi/content/full/33/15/4899#TBL1

Others:

Promoter Databases

Eukaryotic Promoter Database
Mammalian Promoter Database MPromDb
DoOP eukaryotic promoter sequences
DBTSS Database of Transcriptional Start Sites

Transcription Factiors Databases

JASPAR (open transcription factors database)
TESS(finding transcription factor binding sites within your sequence)

Be aware that simply listing all recognizable TF binding sites in a given stretch of DNA is almost of no value. Only 0.1% of the putative sites listed that way will be functional (Applied bioinformatics for the identification of regulatory elements WW Wasserman, A Sandelin - Nature Reviews Genetics, 2004)

Reviews & tutorials

gene regulation tools site very comprehensive

Other tutorial (you need an access to Nature Immunology): [48]
PLOS Apr 2006 MacIsaac and Fraenkel article [49]

For an extensive table of known programs used for finding motifs see [50]

Transcription start finding

There are various clases of such programs starting with completa ab initio promoter finding using genomic sequence only, through combining it with ab initio gene fining, finally taking into account ESTs data and/or gene annotations. Most of the programs were evaluated on human or mammalian sequences and may not be suitable for other groups of organisms. Masking repeats using RepeatMasker improves performance of McPromoter(strongly) and FirstEF (mildly), whereas has no effect on Eponine and DragonGSF.

Programs

N-SCAN (standalone only derivative of TwinSCAN) [51]
Eponine (mammalian) [52]
FirstEF (human) [53]
DragonGSF [54]

McPromoter (fly, with older version for human) [55]

PromH (promoters identification using orthologous genomic sequences) [56]

ARTS (superior but not as Jan 2007 publicly available) [57]
CoreBoost

Newest:

Databases

EPD eukaryotic promoter database
DoOP eukaryotic promoter sequences

VISTA Enhancer Browser Human/vertebrates enhancers DB

Reviews

Bajic et al. Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment

Genome Biology August 2006 [58]

Bajic VB, Tan SL, Suzuki Y, Sugano S: Promoter prediction analysis on the whole human genome. Nat Biotechnol 2004, [59]

SNPs (Single Nucleotide Polymorphism)

NCBI SNP database http://www.ncbi.nlm.nih.gov/SNP/ (strongly recommended)
HGVbase Phenotype/Genotype database (funding expired?)
JSNP http://snp.ims.u-tokyo.ac.jp/ (some access problems)
CSHL http://snp.cshl.org/

Gene expression

Unigene

go to Unigene (ESTs derived from the gene)
search Unigene with gene symbol (i.e SYNGR1) or Unigene cluster (i.e. SYNGR1)
look at Expression and distribution of ESTs [60]
click on Expression profile to see electronic Northern

Symatlas

Wide array of tissues and genes.

go to Symatlas
put your gene symbol, i.e. SYNGR1
click on a small icon left to human or mouse genes

in situ hybridisation images

EMAGE

http://genex.hgu.mrc.ac.uk/Emage/database/emageIntro.html

Gensat

Atlas of in situ hybridization of mouse brain

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gensat

Genepaint

Atlas of in situ hybridization in mouse embryos. One can order hybridization of a gene of interest.

http://www.genepaint.org/Frameset.html

Digital Differential Display

http://www.ncbi.nlm.nih.gov/UniGene/info_ddd.shtml

Searching info about gene homologues in other databases

Mouse [61]
C.elegans & co. [62]
D. melanogaster [63]
S. cerevisiae SGD@stanford [64]
Proteome (multiple organisms) [65] ($$$) (databases of S. cerevisiae, S.pombe and WormPB [C.elegans+]

Finding protein domains

Finding protein motifs

Phosphorylation

GPI-Anchors

http://mendel.imp.ac.at/gpi/gpi_server.html

Disulfide bond predictions

DISULFIND Newest

Sequence/structure retrieval

Entrez
Batch Entrez (retrieving multiple DNA/protein entries at once):
SRS (complex queries )
PROMPT Protein Mapping and Comparison Tool

Protein localization and structure prediction

collection of links: [66]

TarO queries multiple servers

localization

Start with:

Evaluating eukaryotic secreted protein prediction [67]

Localizome server Localizome combines prediction of Phobius, Pfam, LocaloDom (database of TM topologies and TM helix numbers of around 3,000 eukaryotic Pfam domains)
- The maximum number of input proteins in a single submission is 500 protein.
- The length of each sequence is limited to 6,000 residues.

Standard choices

In evaluation (localization):=

MultiLoc
LOCSVMPSI (localization, results by email) site down 2010-03-11
BaCelLo (max five fasta sequences)
ProteinProwler (plant/nonplant and four localizations only)

transmembrane

Recent review: Punta, Marco, Lucy R. Forrest, Henry Bigelow, Andrew Kernytsky, Jinfeng Liu, and Burkhard Rost. “Membrane protein prediction methods.” Methods 41, no. 4 (April 2007): 460-474.

top servers were evaluated by Cuthbertson et al.( Protein Engineering Design and Selection 2005) full text:
- SPLIT4 (first choice)
- TMHMM-2.0
- HMMTOP 2
- TopPred

Other servers:
- ConPred2 Consensus TM prediction (metaserver)
- DAS-TM
- Phobius (includes signal peptide predictions)
- PRODIV-TMHMM ref: [68] (results send by email, prediction of TM topology)
- MemBrain (includes signal peptide predictions) claims to outperform several other top scoring TM predicting programs paper(HTML)

see also links here: [69] and the Signal Peptide section

signal peptides

SignalP
Phobius (predicts signal peptide, TM and orientation)
PolyPhobius Homology supported signal peptide predictions

Accepts both single fasts seqence as well as aligned sequences in multiple fasta format (only the top sequence appears in the output).

Phobius author PhD thesis: [70]

Other Programs:

SPEPlip Eukaryots/Prokaryots
TatP Twin-arginine signal peptide cleavage sites in bacteria Paper: [71]
LipoP lipoprotein signal peptides in Gram-negative bacteria. Paper: [72]

protein structure (no-3D)

Expasy [73]
Coils (coiled coil prediction) [74]
Protscale (various prediction based on AA sequence) [75]
Jpred (multiple secondary structure predictions run in parallel) [76]

Graphical plots of protein 2D structure

TMRPres2D standalone Java program. Top choice:

' 'One-step' data input procedure, including easy querying of the public databases and incorporation of output from various popular prediction tools available from the Web.' allows creating graphs with transmembrane regions, signal peptides, disulfide-bridhes, glicosylated residues etc.

TOPO2 web based

Protein structure databases/comparisons

Metabolic networks

$$ needs more work$$

Reactome
KEGG
BioCyc
Cyclone - provides an open source Java API for easier access to BioCyc.
RegulonDB E.coli K12 DB (operons/genes/regulatory elements)

Protein-protein interactions

Large scale methods such as yeast-two-hybrids (Y2H), immuno-co-precipiatation (IP) produce a large number of false positives. Only about 30-50% of interactions seem to be relevant. While newer improvements of these procedures tend to be more accurate, existing databases struggle to separate good quality data.