Wikiomics:Bioinfo tutorial

From OpenWetWare
Jump to navigationJump to search

Wikiomics:Bioinfo tutorial (exercises)


The goal of this document is to provide a set of illustrative cookbook recipes for wet lab biologists. As there is more than one way of performing a given task, it does point to different tools, but it does not even try to become a yet another directory of links.

You are strongly encouraged to post your comments/ideas on the discussion page of this tutorial.

Before you start

  • use text files/plain e-mail whenever possible
  • give meaningful names to your files (i.e. humP53_XP12345.fas)
  • do not use spaces or Unix special characters in file names (!?"'%&^~*$|/\{}[]()<>:)
  • create separate folders/directories for each project with meaningful names
  • do not waste paper for printing out your blast search results or sequences
  • if primary data needs to be archived burn it on CD using (if possible) non vendor-specific data format

Operating systems/General software

For installing standalone bioinformatics programs the order of preference is: Linux/Unix > MacOS > Windows

Once you move from worksation to bioinformatics application server (almost always Linux/Unix) then as long as you have a Xwindows connection (preferably ssh tunneled) operating system of client workstation plays a minor (if any) role.

DNA sequence processing

Sequence formats

There are multiple sequence formats commonly used. These can be divided into text-only and binary formats.


Unaligned sequences
  • Raw sequence ( example) The text file contains nothing but sequence. The information that's inside has to be encoded in the name of the file. Some web servers still require it.
  • FASTA (description, example) The most commonly used format (by programs/servers). It can contain either single or multiple sequences, each having a name line, also known as the header, starting with ">" in the first column. Keep in mind that various programs/servers introduce extra restrictions how the FastA header should look like. The "safe" standard:
    • Alphanumeric characters only [a-z, A-Z, 0-9], avoid spaces!
    • Keep the names short [i.e. 30 characters are significant for clustalw, only 8 for older versions of t_coffee]
    • In multiple FastA format keep the names unique

Whenever you plan to analyze your sequence using web servers or command line programs, keep it as FASTA. It is easy to cut the raw sequence out of it, and if your file name/sequence names contain accession numbers to find the GenBank entry.

  • GenBank format of choice used by GenBank to display extra information about the sequence. It is fairly complex, but easily readable by humans. Keep your sequences as GenBank when you need extra annotations provided by this format.
  • GCG GCG package own format. If you need to use GCG, you will have to convert your sequences into it
  • XML eXtended Markeup Language. Standard language slowly taking over the Net. XML files are computer-readable (= information is fairly easy to extract from them) and possible to read by humans. Format of choice when complex sequence information need to be described and computer accessible.
  • GFF format for (mostly) genomic sequence annotations used by Ensembl [1]
Aligned sequences
  • MSF Multiple Sequence Format. Used as output for multiple sequence alignments (pileup from GCG) and as input for alignment editors and shadowing programs.
  • ALN standard for multiple sequence alignments. Output format of clustalw and t_coffee.
  • PHYLIP used as an input in such programs like PHYLIP package, phyml
  • NEXUS advanced format allowing more annotation, used by MrBayes


  • ABI binary output format of ABI sequencers. Bulky (= unnecessary large), but so far the only one readable on Mac-OS. Can be easily converted into SCF.
  • SCF Sequence Common Format. Standard output for ALF sequencers. More compact than ABI, and commonly used in sequencing projects. Readable on PC and Unix, binary format of the Staden package.
  • ZTR successor of SCF with internal data compression. Currently the default binary format for Staden.

Sequence conversion tools

OK for simple format, has problems with aligned sequences

A function in Sean Eddy's SQUID package for converting between unaligned and aligned sequence formats.

can be used for sequence conversion, also for many other bioinformatic related functions.

  • for convertions between aligned formats (ALN, MSF, NEXUS, PHYLIP) use:
  1. SeaView GUI program
  2. ToPhylip (WWW) ALN and MSF -> Phylip

Processing ABI files and sequence assembly

Before you even start sequencing, plan the way the sequences will be processed. The most important is to establish a proper naming scheme compatible with various programs you (or others) may use downstream for sequence processing. The basic rules for file names are:

  • use alphanumeric characters [a-z, A-Z, 0-9]
  • use underscore "_" instead of spaces
  • full stop shall separate information about sequencing template and primer, like:

Even small sized projects become cumbersome if file names are not unique (you can not assembly files having the same name!). Once these requirements are satisfied, file name alone will give you enough information about:

  • whose sample is it?
  • to which sequencing project does it belong?
  • what was the sequencing template?
  • which primer was used?
  • was this sample sequenced before?

Programs for DNA sequence assembly

Fasta files (no traces)

  • phrap
phrap your_multiple_fasta_file.fa 

Results in: your_multiple_fasta_file.fa.contigs

Used for i.e assembly of ESTs.

ABI/SCF files (traces)


see also another Wikiomics page Wikiomics:Staden

    • pregap4 (sequence pre-processing)

Takes a directory or a list of files for preprocessing (creating EXP files, clipping bad quality sequence, marking sequencing vector, repeats etc.

Creating sequence out of ABI chromatogram file is done by a basecalling program. One can go along with ABI-provided basecaller or preferably use:

  • phred [2] giving more accurate calls for less accurate part of the sequence (like at the end of the run, say 600bp and more) . Phred also gives a probability/quality values for each of the bases allowing more accurate assembly.
pregap4 %
    • gap4 (sequence assembly, editing, verification and much more)
gap4 &

or (to open a specific database at startup):

gap4 database.version.aux &

Other packages:

Vector masking

  • EMVEC (web and blast2 based) [4]

Good for a single sequence.

  • vectorstrip (Emboss)
vectorstrip -sequence  my_fasta_file -vectors my_vectors_fasta_file -mismatch 30
5' sequence
3' sequence
  • cross_match (from phrap package)
cross_match my_fasta_file my_vectors_fasta_file -minmatch 10 -minscore 20 -screen > screen.out

You will get a file: my_fasta_file.screen --- "devectorised" my_fasta_file

For masking out, let's say, one gene/overlapping clone, etc., type:

cross_match fasta_file_to_be_masked masking_sequence -minmatch 10 -minscore 20 -screen > screen.out

You will get a file: fasta_file_to_be_masked.screen without the sequences matching masking_sequence.

  • Figaro

A novel approach for finding putative vector/adapter sequences without knowledge of the vector sequence. "vector sequence is automatically inferred by analyzing the frequency of occurrence of short oligo-nucleotides using Poisson statistics"

Masking repetitive sequences using RepeatMasker (WWW/Unix)

  • Online web server [5]
  • command line

You have to have a FastA file (it can be multiple FastA). Type:

repmask your_sequence_in_fasta_format

You will get a file: your_sequence_in_fasta_format.masked --- name tells all

species options (choose only one):

-m(us) masks rodent specific and mammalian wide repeats
-rod(ent) same as -mus
-mam(mal) masks repeats found in non-primate, non-rodent mammals
-ar(abidopsis) masks repeats found in Arabidopsis
-dr(osophila) masks repeats found in Drosophilas
-el(egans) masks repeats found in C. elegans

Primer design

General info about primer design/PCR:


  • Pythia new software claiming superiority over primer3
  • Oligo (PC & Mac ) $$$, first choice for small projects

Also search with i.e. "primer pcr" for another, more task specific programs.

  • Primer Select from Lasergene (PC & Mac)

DNA -> protein translation

Cloning in silico

ESTs based

Pre-clustered ESTs:

Genome based

  • based on homology
  • de novo

see also separate page: Wikiomics:Cloning_in_silico

Genome annotation using ESTs assembly

  • PASA tool used in many current genome annotation projects

Synthetic genes

Many genes due to an extreme codon bias can not be expressed in model organisms, such as E.coli. To fix this problem one can back-translate protein sequence into nucleotides, using codons preferably used by E.coli. Moreover one can construct sucgh gene from a set of overlapping oligos, introducing restriction sites or other features at will. Programs performing such tasks:

  • Codon preference in Pichia pastoris

  • Ref
Applied Microbiology and Biotechnology
Issue:  Volume 67, Number 3
Date:  May 2005
Pages: 289 - 298
Strategies for efficient production of heterologous proteins in Escherichia coli
S. Jana1 and J. K. Deb

Codon usage

There is a bias in codon usage resulting in variation in translation speed/efficiency. The bias itself varies between organisms (i.e. it is strong in yeast, then in E.coli, much weaker in B.subtilis).

One can calculate a number of codon bias indexes using programs:

  • E-CAI HTML "The E-CAI server provides a direct threshold value for discerning whether the differences in CAI are statistically significant or whether they are merely artifacts that arise from internal biases in the G+C composition and/or amino acid composition of the query sequences."


The Vienna RNA package has solutions for many classical RNA problems. It well worth taking some time to familiarize oneself with the algorithms in this package before checking out the alternative methods.

Secondary structure prediction

For Wikiomics:RNA secondary structure prediction from a single sequence the most commonly used tools are MFold and RNAfold. For more single sequence and other comparative methods a large range of methods are available.

For the best accuracy (and evolutionary information) comparative approaches, in particular the alignment folding methods, have proven rather successful. Typically a researcher will align sequences using tools such as ClustalW, MAFFT and ProAlign and then fold the alignment using tools such as RNAalifold or Pfold.

RNA homology search

For the Wikiomics:RNA homology search problem (including secondary structure) a number of algorithms exist. E.g. Infernal and RaveNnA. A number of tools specializing in a specific RNA family are also available eg. tRNA, miRNA etc.

RNA gene-finding

Tools for Wikiomics:RNA gene-finding usually use as input sequence alignmments which are then scanned for potential conserved secondary structure signals. Useful examples of these are QRNA, RNAz and EvoFold.

RNA databases

A number of Wikiomics:RNA databases are available.

  • microRNA database (sequences, targets, sumbmission for publications)
  • Rfam a large collection of multiple sequence alignments and covariance models covering many common non-coding RNA families.
  • BRaliBase Benchmarks of RNA analysis tools on trusted ncRNA datasets.

Protein Alignment

See also article in wikipedia: Sequence alignment software

Two sequences

There are two kinds of alignment: local and global.

finds the best region of similarity between two sequences
  • water (Emboss) uses the Smith-Waterman algorithm (modified for speed) web interface
  • lalign Huang and Miller algorithm [10]
  • blast2 [11]
  • bestfit (GCG) Smith-Waterman algorithm

covers the whole length of both sequences
  • needle (Emboss) Needleman-Wunsch global alignment [12]
  • gap (GCG) Needleman-Wunsch global alignment

Multiple sequences

For a recent review see Notredame 2007 @PLOS

There is a trade off between accuracy and speed. For speed select MUSCLE. Accuracy of algorithms differs, with the best results on average being achieved using PRALINE and t-coffee/M-coffee. For poteins with multiple/repeated domains you may try graph based algorithms: poa or aba.

  • t_coffee (good at producing alignments of distant (~30% similarity) sequences. [13]
t_coffee your_fasta_file.fa

T-Coffee has a limit of 50 sequences. The new version is called M-Coffee, it is a meta-aligner using output of several other algorithms (ClustalW, Poa, Muscle, ProbCons, MAFFT, Dialign-T, PCMA). To run it (bash):

export POA_DIR=/absolute/path/to/Poa/installation/directory
export DIALIGNT_DIR=/absolute/path/to/DIALIGN-T/conf
t_coffee your_fasta_file.fa -special_mode mcoffee
  • MAFFT one of the top alignment programs, with multiple options. See benchmarks from 2006 papers here
  • PRALINE (slow but takes into account blast results) [14]

Uses PSI-BLAST for finding most conserved residues in each of the sequences. Very slow but could be more accurate as compared with other methods.

  • clustalw (standard algorithm most commonly used and "good enough" in most cases) [15]
clustalw your_fasta_file.fa

output as aln format

  • dialign (good for creating alignments with long gaps as with non-orthologues sharing domains only) [16]

Improved, command line version is DIALIGN-T [17]

dialign-t ./path/to/dialign-t/conf/ your_fasta_file.fa your_fasta_file.dialign-t

output is in fasta format

muscle -clw -in your_fasta_file.fa -out your_fasta_file.aln -clw
probcons -clustalw -ir 100 -pre 20 -a your_sequence_in_fasta_format.fa > probcons.aln

Graph based algorithm replacing columns with directed acyclic graph. Permits domain recombinations of domains/exons: useful for the alignment of multidomain proteins and ESTs.

 poa -read_fasta your_fasta_file.fa  -clustal output_file.aln /path/to/similarity/matrix/i.e/blosum80.mat 

Method using A-Bruijn graph. Good for alignment of proteins with:

    • domains that are not present in all proteins
    • domains that are present in different orders in different proteins
    • domains that are present in multiple copies in some proteins

DNA: ABA is useful in the alignment of genomic sequences that contain duplications and inversions. Based on:

  • MEME (good for finding as-yet-unknown domains in a set of proteins) MEME web site
meme -nmotifs numberof_motifs_2_find -model oops  -protein -sequence your_fasta_file.fa.fa -outfile

Where 'model' could be oops/zoops/anr.

oops: One Occurrence Per Sequence

zoops: Zero or One Occurrence Per Sequence

anr: Any Number of Repetitions

MEME syntax (see more at

meme your_input_fasta_file.fas         -nmotifs 20 -mod anr  -dna     > your_meme_output.htm &

Alignment accuracy assessment

all alignments have to in aligned fasta format and contain the same sequences in the same order.


Alignment editors & beautifiers

good for sequence conversion (ALN, MSF, Fasta,Pfam, PIR, BLC)

  • clustalx (PC, Unix) [19]

ALN only

  • seaview (Linux) [20]

good for sequence conversion (ALN, MSF, Fasta, Mase, Phylip, Nexus)

  • genedoc (PC only) genedoc(PC only) very flexible color schemes
  • boxshade (no editing, pretty prints only)


LaTeX package. accepts ALN and MSF formats.

  • WAViS web server
    • accepts multi FASTA with Clustal, NEXUS, GCG, Phylip conversion
    • EPS SVG output

Sequence logos

MSA or matrix


  • WebLogo (input as Fasta, ALN formats)


(single HMM profile)

(pair-wise HMM profiles)

DNA alignment

coding DNA alignment based on existing protein alignment

This is a method of constructing more accurate DNA alignment for downstream use in phylogenic analyses taking into account coding properties of given DNA sequences.

  • PAL2NAL webserver capable of dealing with stop codons7frame shifts, suitable also for pseudogenes

Genomic DNA against genomic DNA

GATA: a graphic alignment tool for comparative sequence analysis

  • AuberGene stand alone Python scripts. Requires blastz for pairs comparison.
  • SeqAn::T-Coffee multiple sequence alignment program. SeqAn::T-Coffee aligns amino acid, DNA and RNA sequences using a consistency-based progressive alignment algorithm on a graph of sequence segments.

Non-coding DNA sequences

Designed specially for non-coding DNA sequence Paper: [22]

Similarity plots (dot-plots)

For the explanation how it works look here

Create dot-plots (DNA-DNA, protein-protein and DNA-protein) between one or more sequences. Ideal for quick estimation of sequence similarity, gene organization (genomic DNA vs cDNA), domain finding for small/medium size sequences. One can also zoom an area of interest. Memory and CPU is the limit but one can compare 200kbp genomic DNA vs 6kbp cDNA in ca 2min on 2GHz Pentium 4.

dotter first_fasta_DNA.seq second_fasta_DNA.seq
dotter first_fasta_AA.seq second_fasta_AA.seq
dotter first_fasta_DNA.seq second_fasta_AA.seq

Note that the order of DNA vs protein comparison matters. Try to keep sequence names within your multiple_fasta file short.

Scaling down image 10x with -z option:

dotter first_fasta_DNA.seq second_fasta_DNA.seq -z 10 

Detailed description how to add annotation to dotplots: [23]

The example format of sequence features file (passed with -f option to dotter)

100	@1	CDS	1	1124	RED	gene1
100	@1	CDS	3360	3965	RED	gene2
100	@1	CDS	3360	3968	RED	gene3
  • Gepard bacterial genome scale fast dotplots (Java)

Database searches (sequence similarity)

There is a tradeoff between finding all possible matches (like with Smith-Waterman) and speed. For most applications Blast (and Psi-Blast) is a first choice. For massive searches, where speed is essential (i.e. mapping all cDNAs to a mammalian genome) use blat

Regular expressions

Searches for short motifs, like [MC]-A-x-E(2)-{IE} (M or C followed by A followed by any followed by EE followed by anything but I or E)

  • fuzznuc (from Emboss)

search nucleotide searches

fuzznuc -sequence your_sequence_for_searching.fasta -pattern 'GATTACCA' -mismatch number_of_mismatches_allowed -outfile output_file.txt
  • fuzzpro(from Emboss)

search protein sequences

fuzzpro -sequence your_sequence_for_searching.fasta -pattern 'MEGGADE' -mismatch number_of_mismatches_allowed -outfile output_file.txt
  • dreg (from Emboss)-

Regular expression search of a nucleotide sequence

dreg -sequence your_sequence_for_searching.fasta -pattern '[CG](5)TG{A}N(1,5)C' -mismatch number_of_mismatches_allowed -outfile output_file.txt
  • preg (from Emboss)

Regular expression search of a protein sequence

preg -sequence your_sequence_for_searching.fasta -pattern '[MARE](2)YDY{AGWK}' -mismatch number_of_mismatches_allowed -outfile output_file.txt

For more help about options/switches in Emboss programs type:

program_name -help -verbose


corresponds to:

[C or G] (five times), T, G, {anything except A},1-5 of any, C

More about pattern syntax: [25]

First tier algorithms

For the detailed BLAST tutorial see Wikiomics:BLAST_tutorial

Warning: There is a major change of blast standalone binaries. Info below is describing blastall based programs. While most of the options will work with new blast+ , check the information here

Command line: check first if $BLASTDB and $BLASTMAT are set (it should display these names sans $ signs & pathways):

env | grep BLAST

If you get a blank and these are not set in your '~/.ncbirc' file (tcsh):

setenv $BLASTDB /path/to/directory/with/blast/databases/
setenv $BLASTMAT /path/to/your/director/blast/data

or follow instructions for creating '~/.ncbirc' from here: [27]

to create a database out of your fasta file:

formatdb -i your_multiple_protein_file.fa -p T -V
formatdb -i your_multiple_DNAseq_file.fa  -p F -V

For more options type 'formatdb -h'

    • nucleotide search
blastall -p blastn -d ecoli_nn -i my_nn_fasta_file.fa -o my_nn_fasta_file_vs_ecolidb.blastn_out
    • protein search
blastall -p blastp -d ecoli_aa -i my_nn_fasta_file.fa -o my_nn_fasta_file_vs_ecolidb.blastp_out

finding similarity between two sequences/alignment + dotplot on the web f

comparing protein sequence with DNA:

bl2seq -p tblastn -F N -i your_peptide_sequence.fasta -j your_nucleotide_sequence.fasta
blat /path/to/database your_query_in_fasta.fa [-ooc=11.ooc] output.psl

-ooc=11.ooc option speeds up search by using a set of over-occuring 11-mers to create your database out of DNA sequence file:

faToTwoBit input_fasta.fa [in2.fa in3.fa ...] output.2bit

use output.2bit as a database in your blat searches.

Slower but more sensitive

  • FastA [29]
  • Smith-Waterman/MPsrch [30]
    • only Protien queries. Emboss water [31] can be used to search local nucleotide databases .
  • Scanps (SW variant) [32]

HMM and profile searches

  • HMM searches
hmmbuild your_ALIGNED_sequences.hmm your_ALIGNED_sequences.fasta
hmmcalibrate your_ALIGNED_sequences.hmm
hmmsearch your_ALIGNED_sequences.hmm protein_database.fasta

in HMMER3 you have to specify the format, i.e.

/hmmbuild --amino --informat afa --cpu 8 your_ALIGNED_sequences.hmm your_ALIGNED_sequences.fasta

Command line:

mast -d your_database.multiple_fasta

Data mining

finding basic info about known gene


$$$ very preliminary at the moment $$$

  1. select a set of sequences
  2. create alignment
  3. remove bad regions of the alignment/gaps (tree constructing programs deal badly with indels)
  4. for more than 12 sequences tha space of all possible tree is to large to search one-by-one -> heuristic methods are used


  • maximum parsimony MP

sensitive to the order of sequences.

  • distance methods
    • Neighbor joining NJ
    • UPGMA (Unweighted Pair Group Method with Arithmetic mean) assumes rate of mutation constant along branches of a tree. Less accurate for sequence data than NJ.
  • maximum likehood (computationally intense, but more accurate)

To check validity of a tree:

  • bootstrap

After creating a tree from MSA bootstrapping program selects number of columns from MSA and calculates a tree. This process is repeated 100-1000 times and resulting trees are compared with previously created one. Branches present in more than 70% of bootstrap samples are believed to have a good support. These estimates are conservative

  • multiple samples from the posterior distribution (MrBayes)

Probably overestimate confidence values [38]


  • PHYLYP (phylogeny standard package) see a guide to it: [39]

Individual programs can be scripted. On unix (bash):

    • create a new directory
    • put your_aligned_proteins.phylip there
    • start any text editor, modify content below to suit your options, save as i.e.
proml << EOF
    • run it
    • after completein the run it reports:
Output written to file "outfile"
Tree also written onto file "outtree"
 requires NEXUS format. To tun it on command line: 'mb'

Molecular clock

  • BEAST (Linux, Win, Mac) Bayesian MCMC analysis of molecular sequences

Tree display

For the comprehensive list see:

  • PhyloWidget novel (2008) program (standalone + web java widget)


Genomic structure

One gene in one species

Almost 20 species have as for today (May 2006) separate entries in Ensembl. Possible searches include: blast, GeneBank accession number, gene symbol, chromosomal band, marker etc. To elucidate genomic organization (exons/introns) of a gene starting from a known gene name/accession number.

  • Ensembl (known transcript)
  1. go to Ensembl
  2. click on a link to an organism of interest say Human (Homo sapiens)
  3. on the top of the page, "Search for:" (select "Gene" from pull down menu)
  4. put the gene symbol (i.e. p53), Search
  5. select from the list of hits Ensembl gene (click on the red link)
  6. you are in "Ensembl Gene Report"
  7. right to "Sequence Markup" click on " View genomic sequence for this gene with exons highlighted"
  8. the sequence (including SNPs) is displayed
  • Spidey (novel transcript)
  1. get cDNA sequence in FASTA format
  2. check it for repetitive sequences using RepeatMasker
  3. get genomic sequence containing the gene with 1-10kb flanks
  4. check the gene structure using dotter this is to detect cDNA/genomic DNA miss-assembly/gaps etc.
  5. go to
  6. fill in the form selecting the proper splice site matrix (so far: vertebrates, Drosophila, C. elegans, or plants)
  7. you get a table with genomic coordinates/mRNA coordinates, exon length, % of identity, number ofmismatches, gaps, precence or abcence of splice donor and acceptor sites

Very good at finding exons in the same species (i.e human cDNA/human genomic sequence), more problematic with cross-species comparisons (i.e. human cDNA, bovine genomic sequence -> most of the exons OK, with few bad predictions).

comand line only. Slow but good at finding small exons. So far has matrices for human (default), mouse, rice and C.elegans.

exalin your_genomic_sequence.fa  your_cDNA_sequence.fa

Best for: comparing a protein sequence or hmm profile to a genomic/cDNA sequence from other organism Restrictions: fasta headers longer than 15 characters are truncated

One needs to set up env variable for wise, i.e bash:

export WISECONFIGDIR=/usr/biosoft/packages/wise/wise2.2.3-rc7//wisecfg/
      • peptide vs genomic DNA:
genewise your_peptide_sequence.single_fasta  your_genomic_sequence.single_fasta
      • HMM profile vs genomic DNA:
genewise -hmmer your_hmm_profile.HMM your_genomic_sequence.single_fasta
      • peptide vs cDNA/EST
estwise your_peptide_sequence.single_fasta your_cDNA_sequence.single_fasta 
      • HMM profile vs cDNA/EST
estwise -hmmer  your_hmm_profile.HMM your_cDNA_sequence.single_fasta
  • GeneSeqer WWW server + stand alone program. Multiple species models.

standalone cDNA-to-genome alignment program

  • PALMA 2007 program stand alone claimed to outperform other algorithms. Parameter files for human, C.elegans, arabidopsis and drosophila (Nov 2007)

One gene in multiple species

  • SVC

Structured Visualization of Evolutionary Conserved Sequences. Provides colorful graphs depicting intron exon orientation with sizes.

  1. go to SVC
  2. put gene symbol (i.e. SYNGR3) or better Ensembl ID (i.e. ENSG00000127561) into 'Input ID'
  3. check species of interest in 'Search for orthologs in:'
  4. Click 'Submit' next to 'Input ID'

Bacterial genomics

  • SEED precomputed similarities, annotations using colocalization in operons. Very efective for annotation new genes.
  • Integr8 Ensembl-like interface to bacterial genomes
  • WebACT WebACT genome-2-genome comparisons

Promoter prediction

Predicting promoters based on a single sequence is rather unaccurate. Transcription factor (TF) recognition sequences are fairly short, therefore any sequence 1kbp in lenght will contain hundreds of such motifs, only a handful of which will be a true positives. There are two major methods used for finding relevant sites

  • phylogenetic fingerprinting (comparison of promoter regions between orthologues)
  • comparison of promoter regions of genes with similar expression profiles in the same species

We can look for either known TFs recognition sequences (quality of which depends on database used) or perform motif finding de novo, looking simply for any streach of DNA shared between species/co-expressed genes.

  • before you start

make sure that you got a real 5' end exon of the gene (a lot of longer genes in the databases may be missing the true 5'end). You verify quality of your gene of interest by:

    • comparing with orthologues ( Ensembl or Homologene) If close orhologues are longer on the protein level, look for the presumptive missing exon(s).
    • looking at the genomic context (other genes flanking yours)
    • blasting cDNAs 5' end against ESTs database looking for any ESTs extending your gene
    • blasting masked genomic sequence, from 5' end exon to the start/end of the flanking gene

Interpretation of this one may be tricky, on the simplest level look for spliced matches with ORFs in the same orientation as your gene without clear stop codons.

  • find position of repeats

After checking the 5' end of the gene, always look for repetitive sequences using RepeatMasker. It does not mean that say Alu/LINE1 can not be a part of the promoter. Just that you may concentrate on parts which are a bit more specific for the given gene.

phylogenetic fingerprinting

comparisons of promoter regions of orthologous genes in other species

  1. obtain these using Ensembl [42]
  2. use tools described below to find out shared TF binding sites/comon motifs


  • comparing closely related genomic sequences (human and chimp) is not stringent enough
  • comparing wildely different sequences (human and zebrafish) is too stringent for all but small set of genes (i.e. homeobox?)

comparing promoters of coexpressed genes in the same species

  1. use GEO (Gene Expression Omnibus) [43] to find gene(s) with simmilar expression
  2. get promoter sequences using Ensembl [44]


Shared TF binding sites

Quality depends on the underlying database of TF binding sites

Shared DNA motifs only

For more in depth coverage see: Wikiomics:Sequence_motifs

This method will work even for unknown TF binding sites, but it may skip some more degenerate motifs.

  • MEME capable of discovering longer motifs, very sensitive
  • Improbizer takes into account position of the motifs
  • AlignACE Gibbs sampling algorithm
  • MDscan

It is important to use few complementary programs as well as allow predictions of several motifs per program. There is threshold when it comes to number of sequences used, above which there is no improvement in sensitivity.

Comparison of several programs:


Promoter Databases

Transcription Factiors Databases

  • JASPAR (open transcription factors database)
  • TESS(finding transcription factor binding sites within your sequence)

Be aware that simply listing all recognizable TF binding sites in a given stretch of DNA is almost of no value. Only 0.1% of the putative sites listed that way will be functional (Applied bioinformatics for the identification of regulatory elements WW Wasserman, A Sandelin - Nature Reviews Genetics, 2004)

Reviews & tutorials

  • Other tutorial (you need an access to Nature Immunology): [48]
  • PLOS Apr 2006 MacIsaac and Fraenkel article [49]
  • For an extensive table of known programs used for finding motifs see [50]

Transcription start finding

There are various clases of such programs starting with completa ab initio promoter finding using genomic sequence only, through combining it with ab initio gene fining, finally taking into account ESTs data and/or gene annotations. Most of the programs were evaluated on human or mammalian sequences and may not be suitable for other groups of organisms. Masking repeats using RepeatMasker improves performance of McPromoter(strongly) and FirstEF (mildly), whereas has no effect on Eponine and DragonGSF.


  • N-SCAN (standalone only derivative of TwinSCAN) [51]
  • Eponine (mammalian) [52]
  • FirstEF (human) [53]
  • DragonGSF [54]
  • McPromoter (fly, with older version for human) [55]
  • PromH (promoters identification using orthologous genomic sequences) [56]
  • ARTS (superior but not as Jan 2007 publicly available) [57]
  • CoreBoost



  • EPD eukaryotic promoter database
  • DoOP eukaryotic promoter sequences


  • Bajic et al. Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment

Genome Biology August 2006 [58]

  • Bajic VB, Tan SL, Suzuki Y, Sugano S: Promoter prediction analysis on the whole human genome. Nat Biotechnol 2004, [59]

SNPs (Single Nucleotide Polymorphism)

Gene expression


  1. go to Unigene (ESTs derived from the gene)
  2. search Unigene with gene symbol (i.e SYNGR1) or Unigene cluster (i.e. SYNGR1)
  3. look at Expression and distribution of ESTs [60]
  4. click on Expression profile to see electronic Northern


Wide array of tissues and genes.

  • go to Symatlas
  • put your gene symbol, i.e. SYNGR1
  • click on a small icon left to human or mouse genes

in situ hybridisation images


  • Gensat

Atlas of in situ hybridization of mouse brain

  • Genepaint

Atlas of in situ hybridization in mouse embryos. One can order hybridization of a gene of interest.

Digital Differential Display

Searching info about gene homologues in other databases

  • Mouse [61]
  • C.elegans & co. [62]
  • D. melanogaster [63]
  • S. cerevisiae SGD@stanford [64]
  • Proteome (multiple organisms) [65] ($$$) (databases of S. cerevisiae, S.pombe and WormPB [C.elegans+]

Finding protein domains

Finding protein motifs



Disulfide bond predictions

Sequence/structure retrieval

Protein localization and structure prediction

  • collection of links: [66]
  • TarO queries multiple servers


Start with:

  • Evaluating eukaryotic secreted protein prediction [67]
  • Localizome server Localizome combines prediction of Phobius, Pfam, LocaloDom (database of TM topologies and TM helix numbers of around 3,000 eukaryotic Pfam domains)
    • The maximum number of input proteins in a single submission is 500 protein.
    • The length of each sequence is limited to 6,000 residues.

Standard choices

In evaluation (localization):=


  • Recent review: Punta, Marco, Lucy R. Forrest, Henry Bigelow, Andrew Kernytsky, Jinfeng Liu, and Burkhard Rost. “Membrane protein prediction methods.” Methods 41, no. 4 (April 2007): 460-474.
  • Other servers:
    • ConPred2 Consensus TM prediction (metaserver)
    • DAS-TM
    • Phobius (includes signal peptide predictions)
    • PRODIV-TMHMM ref: [68] (results send by email, prediction of TM topology)
    • MemBrain (includes signal peptide predictions) claims to outperform several other top scoring TM predicting programs paper(HTML)

see also links here: [69] and the Signal Peptide section

signal peptides

Accepts both single fasts seqence as well as aligned sequences in multiple fasta format (only the top sequence appears in the output).

  • Phobius author PhD thesis: [70]

Other Programs:

  • SPEPlip Eukaryots/Prokaryots
  • TatP Twin-arginine signal peptide cleavage sites in bacteria Paper: [71]
  • LipoP lipoprotein signal peptides in Gram-negative bacteria. Paper: [72]

protein structure (no-3D)

  • Expasy [73]
  • Coils (coiled coil prediction) [74]
  • Protscale (various prediction based on AA sequence) [75]
  • Jpred (multiple secondary structure predictions run in parallel) [76]

Graphical plots of protein 2D structure

  • TMRPres2D standalone Java program. Top choice:

' 'One-step' data input procedure, including easy querying of the public databases and incorporation of output from various popular prediction tools available from the Web.' allows creating graphs with transmembrane regions, signal peptides, disulfide-bridhes, glicosylated residues etc.

Protein structure databases/comparisons

Metabolic networks

$$ needs more work$$

Protein-protein interactions

Large scale methods such as yeast-two-hybrids (Y2H), immuno-co-precipiatation (IP) produce a large number of false positives. Only about 30-50% of interactions seem to be relevant. While newer improvements of these procedures tend to be more accurate, existing databases struggle to separate good quality data.

Restricted access ($$$ ?):


  • APID Agile Protein Interaction DataAnalyzer


Where to find more information

Also there are relevant pages on Wikiomics web site: