Wikiomics:Bioinfo tutorial
Wikiomics:Bioinfo tutorial (exercises)
Introduction
The goal of this document is to provide a set of illustrative cookbook recipes for wet lab biologists. As there is more than one way of performing a given task, it does point to different tools, but it does not even try to become a yet another directory of links.
You are strongly encouraged to post your comments/ideas on the discussion page of this tutorial.
Before you start
- use text files/plain e-mail whenever possible
- give meaningful names to your files (i.e. humP53_XP12345.fas)
- do not use spaces or Unix special characters in file names (!?"'%&^~*$|/\{}[]()<>:)
- create separate folders/directories for each project with meaningful names
- do not waste paper for printing out your blast search results or sequences
- if primary data needs to be archived burn it on CD using (if possible) non vendor-specific data format
Operating systems/General software
For installing standalone bioinformatics programs the order of preference is: Linux/Unix > MacOS > Windows
Once you move from worksation to bioinformatics application server (almost always Linux/Unix) then as long as you have a Xwindows connection (preferably ssh tunneled) operating system of client workstation plays a minor (if any) role.
DNA sequence processing
Sequence formats
There are multiple sequence formats commonly used. These can be divided into text-only and binary formats.
Text-only
Unaligned sequences
- Raw sequence ( example) The text file contains nothing but sequence. The information that's inside has to be encoded in the name of the file. Some web servers still require it.
- FASTA (description, example) The most commonly used format (by programs/servers). It can contain either single or multiple sequences, each having a name line, also known as the header, starting with ">" in the first column. Keep in mind that various programs/servers introduce extra restrictions how the FastA header should look like. The "safe" standard:
- Alphanumeric characters only [a-z, A-Z, 0-9], avoid spaces!
- Keep the names short [i.e. 30 characters are significant for clustalw, only 8 for older versions of t_coffee]
- In multiple FastA format keep the names unique
Whenever you plan to analyze your sequence using web servers or command line programs, keep it as FASTA. It is easy to cut the raw sequence out of it, and if your file name/sequence names contain accession numbers to find the GenBank entry.
- GenBank format of choice used by GenBank to display extra information about the sequence. It is fairly complex, but easily readable by humans. Keep your sequences as GenBank when you need extra annotations provided by this format.
- GCG GCG package own format. If you need to use GCG, you will have to convert your sequences into it
- XML eXtended Markeup Language. Standard language slowly taking over the Net. XML files are computer-readable (= information is fairly easy to extract from them) and possible to read by humans. Format of choice when complex sequence information need to be described and computer accessible.
- EXP format used by Wikiomics:Staden sequencing package. Contains both sequence and annotations understandable by Wikiomics:gap4.
- GFF format for (mostly) genomic sequence annotations used by Ensembl [1]
Aligned sequences
- MSF Multiple Sequence Format. Used as output for multiple sequence alignments (pileup from GCG) and as input for alignment editors and shadowing programs.
- ALN standard for multiple sequence alignments. Output format of clustalw and t_coffee.
- PHYLIP used as an input in such programs like PHYLIP package, phyml
- NEXUS advanced format allowing more annotation, used by MrBayes
Binary
- ABI binary output format of ABI sequencers. Bulky (= unnecessary large), but so far the only one readable on Mac-OS. Can be easily converted into SCF.
- SCF Sequence Common Format. Standard output for ALF sequencers. More compact than ABI, and commonly used in sequencing projects. Readable on PC and Unix, binary format of the Staden package.
- ZTR successor of SCF with internal data compression. Currently the default binary format for Staden.
Sequence conversion tools
OK for simple format, has problems with aligned sequences
A function in Sean Eddy's SQUID package for converting between unaligned and aligned sequence formats.
can be used for sequence conversion, also for many other bioinformatic related functions.
- for convertions between aligned formats (ALN, MSF, NEXUS, PHYLIP) use:
Processing ABI files and sequence assembly
Before you even start sequencing, plan the way the sequences will be processed. The most important is to establish a proper naming scheme compatible with various programs you (or others) may use downstream for sequence processing. The basic rules for file names are:
- use alphanumeric characters [a-z, A-Z, 0-9]
- use underscore "_" instead of spaces
- full stop shall separate information about sequencing template and primer, like:
prefix_my-template.my-primer_reaction-number
Even small sized projects become cumbersome if file names are not unique (you can not assembly files having the same name!). Once these requirements are satisfied, file name alone will give you enough information about:
- whose sample is it?
- to which sequencing project does it belong?
- what was the sequencing template?
- which primer was used?
- was this sample sequenced before?
Programs for DNA sequence assembly
Fasta files (no traces)
- phrap
phrap your_multiple_fasta_file.fa
Results in: your_multiple_fasta_file.fa.contigs
Used for i.e assembly of ESTs.
ABI/SCF files (traces)
recommended:
- Staden package (Unix, Windows, OSX)
see also another Wikiomics page Wikiomics:Staden
- pregap4 (sequence pre-processing)
Takes a directory or a list of files for preprocessing (creating EXP files, clipping bad quality sequence, marking sequencing vector, repeats etc.
Creating sequence out of ABI chromatogram file is done by a basecalling program. One can go along with ABI-provided basecaller or preferably use:
- phred [2] giving more accurate calls for less accurate part of the sequence (like at the end of the run, say 600bp and more) . Phred also gives a probability/quality values for each of the bases allowing more accurate assembly.
pregap4 %
- gap4 (sequence assembly, editing, verification and much more)
gap4 &
or (to open a specific database at startup):
gap4 database.version.aux &
Other packages:
Vector masking
- EMVEC (web and blast2 based) [4]
Good for a single sequence.
- vectorstrip (Emboss)
vectorstrip -sequence my_fasta_file -vectors my_vectors_fasta_file -mismatch 30 -linkera TAGGATACAG -linkerb TTTTTCAGGACAGACA
- linkera
- 5' sequence
- linkerb
- 3' sequence
- cross_match (from phrap package)
cross_match my_fasta_file my_vectors_fasta_file -minmatch 10 -minscore 20 -screen > screen.out
You will get a file: my_fasta_file.screen --- "devectorised" my_fasta_file
For masking out, let's say, one gene/overlapping clone, etc., type:
cross_match fasta_file_to_be_masked masking_sequence -minmatch 10 -minscore 20 -screen > screen.out
You will get a file: fasta_file_to_be_masked.screen without the sequences matching masking_sequence.
- Figaro
A novel approach for finding putative vector/adapter sequences without knowledge of the vector sequence. "vector sequence is automatically inferred by analyzing the frequency of occurrence of short oligo-nucleotides using Poisson statistics" http://sourceforge.net/apps/mediawiki/amos/index.php?title=Figaro
Masking repetitive sequences using RepeatMasker (WWW/Unix)
- web site: http://www.repeatmasker.org/
- current version (checked on 2010-03.22): 3.2.8
- documentation: http://www.repeatmasker.org/webrepeatmaskerhelp.html
- Online web server [5]
- command line
You have to have a FastA file (it can be multiple FastA). Type:
repmask your_sequence_in_fasta_format
You will get a file: your_sequence_in_fasta_format.masked --- name tells all
species options (choose only one):
-m(us) masks rodent specific and mammalian wide repeats -rod(ent) same as -mus -mam(mal) masks repeats found in non-primate, non-rodent mammals -ar(abidopsis) masks repeats found in Arabidopsis -dr(osophila) masks repeats found in Drosophilas -el(egans) masks repeats found in C. elegans
Primer design
General info about primer design/PCR:
online:
- Primer3plus
- BatchPrimer3 derivative program for designing multiple primers at once
- Pythia new software claiming superiority over primer3
- Oligo (PC & Mac ) $$$, first choice for small projects
- JCVI Primer Designer pipeline primer design for DNA sequencing projects
Also search http://sourceforge.net/ with i.e. "primer pcr" for another, more task specific programs.
- Primer Select from Lasergene (PC & Mac)
DNA -> protein translation
- transeq (Emboss) WWW server
- Protein Translate (ExPasy): WWW server
- ORF Finder WWW server
Cloning in silico
ESTs based
Pre-clustered ESTs:
- STACKdb (limited access, tissue specific splice forms) [6]
- Unigene (no consensus sequence) [7]
- TIGR Gene Indexes
Genome based
- based on homology
- de novo
see also separate page: Wikiomics:Cloning_in_silico
Genome annotation using ESTs assembly
- PASA tool used in many current genome annotation projects
Synthetic genes
Many genes due to an extreme codon bias can not be expressed in model organisms, such as E.coli. To fix this problem one can back-translate protein sequence into nucleotides, using codons preferably used by E.coli. Moreover one can construct sucgh gene from a set of overlapping oligos, introducing restriction sites or other features at will. Programs performing such tasks:
- JCAT http://www.jcat.de/
- GeneDesign
- SGD Syntetic Gene Designer
- Synthetic Gene DataBase [8]
- stand alone program UpGene (SmallTalk) http://www.vectorcore.pitt.edu/upgene/upgene.html
- Codon preference in Pichia pastoris
http://www.microbialcellfactories.com/content/4/1/33
- Ref
Applied Microbiology and Biotechnology Issue: Volume 67, Number 3 Date: May 2005 Pages: 289 - 298 Strategies for efficient production of heterologous proteins in Escherichia coli S. Jana1 and J. K. Deb
Codon usage
There is a bias in codon usage resulting in variation in translation speed/efficiency. The bias itself varies between organisms (i.e. it is strong in yeast, then in E.coli, much weaker in B.subtilis).
- codon usage database [9]
- graphical codon usage analysis http://gcua.schoedl.de/
One can calculate a number of codon bias indexes using programs:
- CodonW web server, Sourceforge project
- E-CAI HTML "The E-CAI server provides a direct threshold value for discerning whether the differences in CAI are statistically significant or whether they are merely artifacts that arise from internal biases in the G+C composition and/or amino acid composition of the query sequences."
RNA
The Vienna RNA package has solutions for many classical RNA problems. It well worth taking some time to familiarize oneself with the algorithms in this package before checking out the alternative methods.
Secondary structure prediction
For Wikiomics:RNA secondary structure prediction from a single sequence the most commonly used tools are MFold and RNAfold. For more single sequence and other comparative methods a large range of methods are available.
For the best accuracy (and evolutionary information) comparative approaches, in particular the alignment folding methods, have proven rather successful. Typically a researcher will align sequences using tools such as ClustalW, MAFFT and ProAlign and then fold the alignment using tools such as RNAalifold or Pfold.
RNA homology search
For the Wikiomics:RNA homology search problem (including secondary structure) a number of algorithms exist. E.g. Infernal and RaveNnA. A number of tools specializing in a specific RNA family are also available eg. tRNA, miRNA etc.
RNA gene-finding
Tools for Wikiomics:RNA gene-finding usually use as input sequence alignmments which are then scanned for potential conserved secondary structure signals. Useful examples of these are QRNA, RNAz and EvoFold.
RNA databases
A number of Wikiomics:RNA databases are available.
- microRNA database (sequences, targets, sumbmission for publications)
- Rfam a large collection of multiple sequence alignments and covariance models covering many common non-coding RNA families.
- BRaliBase Benchmarks of RNA analysis tools on trusted ncRNA datasets.
Protein Alignment
See also article in wikipedia: Sequence alignment software
Two sequences
There are two kinds of alignment: local and global.
- Local
- finds the best region of similarity between two sequences
- water (Emboss) uses the Smith-Waterman algorithm (modified for speed) web interface
- lalign Huang and Miller algorithm [10]
- blast2 [11]
- bestfit (GCG) Smith-Waterman algorithm
- Global
- covers the whole length of both sequences
- needle (Emboss) Needleman-Wunsch global alignment [12]
- gap (GCG) Needleman-Wunsch global alignment
Multiple sequences
For a recent review see Notredame 2007 @PLOS
There is a trade off between accuracy and speed. For speed select MUSCLE. Accuracy of algorithms differs, with the best results on average being achieved using PRALINE and t-coffee/M-coffee. For poteins with multiple/repeated domains you may try graph based algorithms: poa or aba.
- t_coffee (good at producing alignments of distant (~30% similarity) sequences. [13]
t_coffee your_fasta_file.fa
T-Coffee has a limit of 50 sequences. The new version is called M-Coffee, it is a meta-aligner using output of several other algorithms (ClustalW, Poa, Muscle, ProbCons, MAFFT, Dialign-T, PCMA). To run it (bash):
export POA_DIR=/absolute/path/to/Poa/installation/directory export DIALIGNT_DIR=/absolute/path/to/DIALIGN-T/conf t_coffee your_fasta_file.fa -special_mode mcoffee
- MAFFT one of the top alignment programs, with multiple options. See benchmarks from 2006 papers here
- PRALINE (slow but takes into account blast results) [14]
Uses PSI-BLAST for finding most conserved residues in each of the sequences. Very slow but could be more accurate as compared with other methods.
- clustalw (standard algorithm most commonly used and "good enough" in most cases) [15]
clustalw your_fasta_file.fa
output as aln format
- dialign (good for creating alignments with long gaps as with non-orthologues sharing domains only) [16]
Improved, command line version is DIALIGN-T [17]
dialign-t ./path/to/dialign-t/conf/ your_fasta_file.fa your_fasta_file.dialign-t
output is in fasta format
- MUSCLE http://www.ebi.ac.uk/muscle/
muscle -clw -in your_fasta_file.fa -out your_fasta_file.aln -clw
- probcons http://probcons.stanford.edu/
probcons -clustalw -ir 100 -pre 20 -a your_sequence_in_fasta_format.fa > probcons.aln
Graph based algorithm replacing columns with directed acyclic graph. Permits domain recombinations of domains/exons: useful for the alignment of multidomain proteins and ESTs.
poa -read_fasta your_fasta_file.fa -clustal output_file.aln /path/to/similarity/matrix/i.e/blosum80.mat
Method using A-Bruijn graph. Good for alignment of proteins with:
- domains that are not present in all proteins
- domains that are present in different orders in different proteins
- domains that are present in multiple copies in some proteins
DNA: ABA is useful in the alignment of genomic sequences that contain duplications and inversions. Based on: http://www.genome.org/cgi/content/abstract/14/11/2336
- MEME (good for finding as-yet-unknown domains in a set of proteins) MEME web site
meme -nmotifs numberof_motifs_2_find -model oops -protein -sequence your_fasta_file.fa.fa -outfile your_fasta_file.fa.fa.meme
Where 'model' could be oops/zoops/anr.
oops: One Occurrence Per Sequence
zoops: Zero or One Occurrence Per Sequence
anr: Any Number of Repetitions
MEME syntax (see more at http://meme.nbcr.net/meme4_3_0/documentation.html):
meme your_input_fasta_file.fas -nmotifs 20 -mod anr -dna > your_meme_output.htm &
Alignment accuracy assessment
- MUMSA (WWW)
all alignments have to in aligned fasta format and contain the same sequences in the same order.
Paper: http://nar.oxfordjournals.org/cgi/content/full/33/22/7120
Alignment editors & beautifiers
- Jalview (Java) [18]
good for sequence conversion (ALN, MSF, Fasta,Pfam, PIR, BLC)
- clustalx (PC, Unix) [19]
ALN only
- seaview (Linux) [20]
good for sequence conversion (ALN, MSF, Fasta, Mase, Phylip, Nexus)
- genedoc (PC only) genedoc(PC only) very flexible color schemes
- boxshade (no editing, pretty prints only)
ALN, MSF
LaTeX package. accepts ALN and MSF formats.
- WAViS web server
- accepts multi FASTA with Clustal, NEXUS, GCG, Phylip conversion
- EPS SVG output
Sequence logos
MSA or matrix
- enoLOGOS (first choice)
MSA
- WebLogo (input as Fasta, ALN formats)
HMM
(single HMM profile)
(pair-wise HMM profiles)
DNA alignment
coding DNA alignment based on existing protein alignment
This is a method of constructing more accurate DNA alignment for downstream use in phylogenic analyses taking into account coding properties of given DNA sequences.
- PAL2NAL webserver capable of dealing with stop codons7frame shifts, suitable also for pseudogenes
- transAlign Perl script, requires Clustalw article
- RevTrans
- protal2dna
- CodonAlign stand alone program
Genomic DNA against genomic DNA
- MUMmer
- PipMaker aligns two DNA sequences
- MultiPipMaker (web server) article Aligns 2-30 DNA sequences
GATA: a graphic alignment tool for comparative sequence analysis
- AuberGene stand alone Python scripts. Requires blastz for pairs comparison.
- SeqAn::T-Coffee multiple sequence alignment program. SeqAn::T-Coffee aligns amino acid, DNA and RNA sequences using a consistency-based progressive alignment algorithm on a graph of sequence segments.
Non-coding DNA sequences
- Sigma [21]
Designed specially for non-coding DNA sequence Paper: [22]
Similarity plots (dot-plots)
For the explanation how it works look here
- dotter(Unix, PC)
Create dot-plots (DNA-DNA, protein-protein and DNA-protein) between one or more sequences. Ideal for quick estimation of sequence similarity, gene organization (genomic DNA vs cDNA), domain finding for small/medium size sequences. One can also zoom an area of interest. Memory and CPU is the limit but one can compare 200kbp genomic DNA vs 6kbp cDNA in ca 2min on 2GHz Pentium 4.
dotter first_fasta_DNA.seq second_fasta_DNA.seq dotter first_fasta_AA.seq second_fasta_AA.seq dotter first_fasta_DNA.seq second_fasta_AA.seq
Note that the order of DNA vs protein comparison matters. Try to keep sequence names within your multiple_fasta file short.
Scaling down image 10x with -z option:
dotter first_fasta_DNA.seq second_fasta_DNA.seq -z 10
Detailed description how to add annotation to dotplots:
[23]
The example format of sequence features file (passed with -f option to dotter)
100 @1 CDS 1 1124 RED gene1 100 @1 CDS 3360 3965 RED gene2 100 @1 CDS 3360 3968 RED gene3
- dottup info (Emboss, Unix/Web) web form
- Gepard bacterial genome scale fast dotplots (Java)
Database searches (sequence similarity)
There is a tradeoff between finding all possible matches (like with Smith-Waterman) and speed. For most applications Blast (and Psi-Blast) is a first choice. For massive searches, where speed is essential (i.e. mapping all cDNAs to a mammalian genome) use blat
Regular expressions
Searches for short motifs, like [MC]-A-x-E(2)-{IE} (M or C followed by A followed by any followed by EE followed by anything but I or E)
- fuzznuc (from Emboss)
search nucleotide searches
fuzznuc -sequence your_sequence_for_searching.fasta -pattern 'GATTACCA' -mismatch number_of_mismatches_allowed -outfile output_file.txt
- fuzzpro(from Emboss)
search protein sequences
fuzzpro -sequence your_sequence_for_searching.fasta -pattern 'MEGGADE' -mismatch number_of_mismatches_allowed -outfile output_file.txt
- dreg (from Emboss)-
Regular expression search of a nucleotide sequence
dreg -sequence your_sequence_for_searching.fasta -pattern '[CG](5)TG{A}N(1,5)C' -mismatch number_of_mismatches_allowed -outfile output_file.txt
- preg (from Emboss)
Regular expression search of a protein sequence
preg -sequence your_sequence_for_searching.fasta -pattern '[MARE](2)YDY{AGWK}' -mismatch number_of_mismatches_allowed -outfile output_file.txt
For more help about options/switches in Emboss programs type:
program_name -help -verbose
- ScanProsite [24]
[CG](5)TG{A}N(1,5)C
corresponds to:
[C or G] (five times), T, G, {anything except A},1-5 of any, C
More about pattern syntax: [25]
First tier algorithms
- Blast [26]
For the detailed BLAST tutorial see Wikiomics:BLAST_tutorial
Warning: There is a major change of blast standalone binaries. Info below is describing blastall based programs. While most of the options will work with new blast+ , check the information here
Command line: check first if $BLASTDB and $BLASTMAT are set (it should display these names sans $ signs & pathways):
env | grep BLAST
If you get a blank and these are not set in your '~/.ncbirc' file (tcsh):
setenv $BLASTDB /path/to/directory/with/blast/databases/ setenv $BLASTMAT /path/to/your/director/blast/data
or follow instructions for creating '~/.ncbirc' from here: [27]
to create a database out of your fasta file:
formatdb -i your_multiple_protein_file.fa -p T -V formatdb -i your_multiple_DNAseq_file.fa -p F -V
For more options type 'formatdb -h'
- nucleotide search
blastall -p blastn -d ecoli_nn -i my_nn_fasta_file.fa -o my_nn_fasta_file_vs_ecolidb.blastn_out
- protein search
blastall -p blastp -d ecoli_aa -i my_nn_fasta_file.fa -o my_nn_fasta_file_vs_ecolidb.blastp_out
- Blast2 [28]
finding similarity between two sequences/alignment + dotplot on the web f
comparing protein sequence with DNA:
bl2seq -p tblastn -F N -i your_peptide_sequence.fasta -j your_nucleotide_sequence.fasta
- Blat
- description: http://www.genome.org/cgi/content/full/12/4/656
- web server: http://genome.ucsc.edu/cgi-bin/hgBlat?command=start
- command line
blat /path/to/database your_query_in_fasta.fa [-ooc=11.ooc] output.psl
-ooc=11.ooc option speeds up search by using a set of over-occuring 11-mers to create your database out of DNA sequence file:
faToTwoBit input_fasta.fa [in2.fa in3.fa ...] output.2bit
use output.2bit as a database in your blat searches.
Slower but more sensitive
- FastA [29]
- Smith-Waterman/MPsrch [30]
- only Protien queries. Emboss water [31] can be used to search local nucleotide databases .
- Scanps (SW variant) [32]
HMM and profile searches
- PSI-BLAST
- HMM searches
- hmmer [33]
hmmbuild your_ALIGNED_sequences.hmm your_ALIGNED_sequences.fasta hmmcalibrate your_ALIGNED_sequences.hmm hmmsearch your_ALIGNED_sequences.hmm protein_database.fasta
in HMMER3 you have to specify the format, i.e.
/hmmbuild --amino --informat afa --cpu 8 your_ALIGNED_sequences.hmm your_ALIGNED_sequences.fasta
- MEME MAST [ http://meme.sdsc.edu/]
Command line:
mast your_meme_motifs.meme -d your_database.multiple_fasta
Data mining
finding basic info about known gene
Phylogeny
$$$ very preliminary at the moment $$$
- select a set of sequences
- create alignment
- remove bad regions of the alignment/gaps (tree constructing programs deal badly with indels)
- for more than 12 sequences tha space of all possible tree is to large to search one-by-one -> heuristic methods are used
Methods
- maximum parsimony MP
sensitive to the order of sequences.
- distance methods
- Neighbor joining NJ
- UPGMA (Unweighted Pair Group Method with Arithmetic mean) assumes rate of mutation constant along branches of a tree. Less accurate for sequence data than NJ.
- maximum likehood (computationally intense, but more accurate)
To check validity of a tree:
- bootstrap
After creating a tree from MSA bootstrapping program selects number of columns from MSA and calculates a tree. This process is repeated 100-1000 times and resulting trees are compared with previously created one. Branches present in more than 70% of bootstrap samples are believed to have a good support. These estimates are conservative
- multiple samples from the posterior distribution (MrBayes)
Probably overestimate confidence values [38]
Programs
- PHYLYP (phylogeny standard package) see a guide to it: [39]
Individual programs can be scripted. On unix (bash):
- create a new directory
- put your_aligned_proteins.phylip there
- start any text editor, modify content below to suit your options, save as i.e. run_proml_with_my_options.sh
#!/usr/bin/sh proml << EOF your_aligned_proteins.phylip 1 S y EOF
- run it
source run_proml_with_my_options.sh
- after completein the run it reports:
Output written to file "outfile" Tree also written onto file "outtree"
- PHYML http://atgc.lirmm.fr/phyml/ (ML)
- MrBayes http://mrbayes.csit.fsu.edu/ (Bayes / Markov chain Monte Carlo.)
requires NEXUS format. To tun it on command line: 'mb'
- RAxML-VI http://www.ics.forth.gr/~stamatak/index-Dateien/Page443.htm (ML for large phylogenetic trees)
- TREE-PUZZLE http://www.tree-puzzle.de/ (ML)
- Mesquite collection of modules for performing a wide range of analyses from various branches of evolutionary biology (e.g., phylogenetics, molecular evolution, population genetics, geometric morphometrics)
Molecular clock
- BEAST (Linux, Win, Mac) Bayesian MCMC analysis of molecular sequences
Tree display
For the comprehensive list see: http://en.wikipedia.org/wiki/List_of_phylogenetic_tree_visualization_software
- PhyloWidget novel (2008) program (standalone + web java widget)
References
- extensive list of programs: http://evolution.genetics.washington.edu/phylip/software.html
- tutorial articles:
Genomic structure
One gene in one species
Almost 20 species have as for today (May 2006) separate entries in Ensembl. Possible searches include: blast, GeneBank accession number, gene symbol, chromosomal band, marker etc. To elucidate genomic organization (exons/introns) of a gene starting from a known gene name/accession number.
- Ensembl (known transcript)
- go to Ensembl
- click on a link to an organism of interest say Human (Homo sapiens)
- on the top of the page, "Search for:" (select "Gene" from pull down menu)
- put the gene symbol (i.e. p53), Search
- select from the list of hits Ensembl gene (click on the red link)
- you are in "Ensembl Gene Report"
- right to "Sequence Markup" click on " View genomic sequence for this gene with exons highlighted"
- the sequence (including SNPs) is displayed
- Spidey (novel transcript)
- get cDNA sequence in FASTA format
- check it for repetitive sequences using RepeatMasker
- get genomic sequence containing the gene with 1-10kb flanks
- check the gene structure using dotter this is to detect cDNA/genomic DNA miss-assembly/gaps etc.
- go to http://www.ncbi.nlm.nih.gov/IEB/Research/Ostell/Spidey/spideyweb.cgi
- fill in the form selecting the proper splice site matrix (so far: vertebrates, Drosophila, C. elegans, or plants)
- you get a table with genomic coordinates/mRNA coordinates, exon length, % of identity, number ofmismatches, gaps, precence or abcence of splice donor and acceptor sites
Very good at finding exons in the same species (i.e human cDNA/human genomic sequence), more problematic with cross-species comparisons (i.e. human cDNA, bovine genomic sequence -> most of the exons OK, with few bad predictions).
comand line only. Slow but good at finding small exons. So far has matrices for human (default), mouse, rice and C.elegans.
exalin your_genomic_sequence.fa your_cDNA_sequence.fa
- wise2 [40]
Best for: comparing a protein sequence or hmm profile to a genomic/cDNA sequence from other organism Restrictions: fasta headers longer than 15 characters are truncated
One needs to set up env variable for wise, i.e bash:
export WISECONFIGDIR=/usr/biosoft/packages/wise/wise2.2.3-rc7//wisecfg/
- peptide vs genomic DNA:
genewise your_peptide_sequence.single_fasta your_genomic_sequence.single_fasta
- HMM profile vs genomic DNA:
genewise -hmmer your_hmm_profile.HMM your_genomic_sequence.single_fasta
- peptide vs cDNA/EST
estwise your_peptide_sequence.single_fasta your_cDNA_sequence.single_fasta
- HMM profile vs cDNA/EST
estwise -hmmer your_hmm_profile.HMM your_cDNA_sequence.single_fasta
- GeneSeqer WWW server + stand alone program. Multiple species models.
- Pairagon [41]
standalone cDNA-to-genome alignment program
- PALMA 2007 program stand alone claimed to outperform other algorithms. Parameter files for human, C.elegans, arabidopsis and drosophila (Nov 2007)
One gene in multiple species
- SVC
Structured Visualization of Evolutionary Conserved Sequences. Provides colorful graphs depicting intron exon orientation with sizes.
- go to SVC
- put gene symbol (i.e. SYNGR3) or better Ensembl ID (i.e. ENSG00000127561) into 'Input ID'
- check species of interest in 'Search for orthologs in:'
- Click 'Submit' next to 'Input ID'
Bacterial genomics
- SEED precomputed similarities, annotations using colocalization in operons. Very efective for annotation new genes.
- Integr8 Ensembl-like interface to bacterial genomes
- WebACT WebACT genome-2-genome comparisons
Promoter prediction
Predicting promoters based on a single sequence is rather unaccurate. Transcription factor (TF) recognition sequences are fairly short, therefore any sequence 1kbp in lenght will contain hundreds of such motifs, only a handful of which will be a true positives. There are two major methods used for finding relevant sites
- phylogenetic fingerprinting (comparison of promoter regions between orthologues)
- comparison of promoter regions of genes with similar expression profiles in the same species
We can look for either known TFs recognition sequences (quality of which depends on database used) or perform motif finding de novo, looking simply for any streach of DNA shared between species/co-expressed genes.
- before you start
make sure that you got a real 5' end exon of the gene (a lot of longer genes in the databases may be missing the true 5'end). You verify quality of your gene of interest by:
- comparing with orthologues ( Ensembl or Homologene) If close orhologues are longer on the protein level, look for the presumptive missing exon(s).
- looking at the genomic context (other genes flanking yours)
- blasting cDNAs 5' end against ESTs database looking for any ESTs extending your gene
- blasting masked genomic sequence, from 5' end exon to the start/end of the flanking gene
Interpretation of this one may be tricky, on the simplest level look for spliced matches with ORFs in the same orientation as your gene without clear stop codons.
- find position of repeats
After checking the 5' end of the gene, always look for repetitive sequences using RepeatMasker. It does not mean that say Alu/LINE1 can not be a part of the promoter. Just that you may concentrate on parts which are a bit more specific for the given gene.
phylogenetic fingerprinting
comparisons of promoter regions of orthologous genes in other species
- obtain these using Ensembl [42]
- use tools described below to find out shared TF binding sites/comon motifs
Caveats:
- comparing closely related genomic sequences (human and chimp) is not stringent enough
- comparing wildely different sequences (human and zebrafish) is too stringent for all but small set of genes (i.e. homeobox?)
comparing promoters of coexpressed genes in the same species
- use GEO (Gene Expression Omnibus) [43] to find gene(s) with simmilar expression
- get promoter sequences using Ensembl [44]
Tools:
Quality depends on the underlying database of TF binding sites
- Consite produces a set of transcription factors binding sites shared between two sequences
- zpicture plus rVISTA
- Conreal (multiple alignment algorithms) [45]
- Confac [46]
- Mulan multiple sequences
- Footprinter [47]
For more in depth coverage see: Wikiomics:Sequence_motifs
This method will work even for unknown TF binding sites, but it may skip some more degenerate motifs.
- MEME capable of discovering longer motifs, very sensitive
- Improbizer takes into account position of the motifs
- AlignACE Gibbs sampling algorithm
- MDscan
It is important to use few complementary programs as well as allow predictions of several motifs per program. There is threshold when it comes to number of sequences used, above which there is no improvement in sensitivity.
Comparison of several programs: http://nar.oxfordjournals.org/cgi/content/full/33/15/4899#TBL1
Others:
Promoter Databases
- Eukaryotic Promoter Database
- Mammalian Promoter Database MPromDb
- DoOP eukaryotic promoter sequences
- DBTSS Database of Transcriptional Start Sites
Transcription Factiors Databases
- JASPAR (open transcription factors database)
- TESS(finding transcription factor binding sites within your sequence)
Be aware that simply listing all recognizable TF binding sites in a given stretch of DNA is almost of no value. Only 0.1% of the putative sites listed that way will be functional (Applied bioinformatics for the identification of regulatory elements WW Wasserman, A Sandelin - Nature Reviews Genetics, 2004)
Reviews & tutorials
- gene regulation tools site very comprehensive
- Other tutorial (you need an access to Nature Immunology): [48]
- PLOS Apr 2006 MacIsaac and Fraenkel article [49]
- For an extensive table of known programs used for finding motifs see [50]
Transcription start finding
There are various clases of such programs starting with completa ab initio promoter finding using genomic sequence only, through combining it with ab initio gene fining, finally taking into account ESTs data and/or gene annotations. Most of the programs were evaluated on human or mammalian sequences and may not be suitable for other groups of organisms. Masking repeats using RepeatMasker improves performance of McPromoter(strongly) and FirstEF (mildly), whereas has no effect on Eponine and DragonGSF.
Programs
- N-SCAN (standalone only derivative of TwinSCAN) [51]
- Eponine (mammalian) [52]
- FirstEF (human) [53]
- DragonGSF [54]
- McPromoter (fly, with older version for human) [55]
- PromH (promoters identification using orthologous genomic sequences) [56]
Newest:
Databases
- VISTA Enhancer Browser Human/vertebrates enhancers DB
Reviews
- Bajic et al. Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment
Genome Biology August 2006 [58]
- Bajic VB, Tan SL, Suzuki Y, Sugano S: Promoter prediction analysis on the whole human genome. Nat Biotechnol 2004, [59]
SNPs (Single Nucleotide Polymorphism)
- NCBI SNP database http://www.ncbi.nlm.nih.gov/SNP/ (strongly recommended)
- HGVbase Phenotype/Genotype database (funding expired?)
- JSNP http://snp.ims.u-tokyo.ac.jp/ (some access problems)
- CSHL http://snp.cshl.org/
Gene expression
Unigene
- go to Unigene (ESTs derived from the gene)
- search Unigene with gene symbol (i.e SYNGR1) or Unigene cluster (i.e. SYNGR1)
- look at Expression and distribution of ESTs [60]
- click on Expression profile to see electronic Northern
Symatlas
Wide array of tissues and genes.
- go to Symatlas
- put your gene symbol, i.e. SYNGR1
- click on a small icon left to human or mouse genes
in situ hybridisation images
- EMAGE
http://genex.hgu.mrc.ac.uk/Emage/database/emageIntro.html
- Gensat
Atlas of in situ hybridization of mouse brain
- Genepaint
Atlas of in situ hybridization in mouse embryos. One can order hybridization of a gene of interest.
Digital Differential Display
http://www.ncbi.nlm.nih.gov/UniGene/info_ddd.shtml
Searching info about gene homologues in other databases
- Mouse [61]
- C.elegans & co. [62]
- D. melanogaster [63]
- S. cerevisiae SGD@stanford [64]
- Proteome (multiple organisms) [65] ($$$) (databases of S. cerevisiae, S.pombe and WormPB [C.elegans+]
Finding protein domains
Finding protein motifs
Phosphorylation
GPI-Anchors
Disulfide bond predictions
- DISULFIND Newest
Sequence/structure retrieval
- Entrez
- Batch Entrez (retrieving multiple DNA/protein entries at once):
- SRS (complex queries )
- PROMPT Protein Mapping and Comparison Tool
Protein localization and structure prediction
- collection of links: [66]
- TarO queries multiple servers
localization
Start with:
- Evaluating eukaryotic secreted protein prediction [67]
- Localizome server Localizome combines prediction of Phobius, Pfam, LocaloDom (database of TM topologies and TM helix numbers of around 3,000 eukaryotic Pfam domains)
- The maximum number of input proteins in a single submission is 500 protein.
- The length of each sequence is limited to 6,000 residues.
Standard choices
In evaluation (localization):=
- MultiLoc
- LOCSVMPSI (localization, results by email) site down 2010-03-11
- BaCelLo (max five fasta sequences)
- ProteinProwler (plant/nonplant and four localizations only)
transmembrane
- Recent review: Punta, Marco, Lucy R. Forrest, Henry Bigelow, Andrew Kernytsky, Jinfeng Liu, and Burkhard Rost. “Membrane protein prediction methods.” Methods 41, no. 4 (April 2007): 460-474.
- top servers were evaluated by Cuthbertson et al.( Protein Engineering Design and Selection 2005) full text:
- Other servers:
- ConPred2 Consensus TM prediction (metaserver)
- DAS-TM
- Phobius (includes signal peptide predictions)
- PRODIV-TMHMM ref: [68] (results send by email, prediction of TM topology)
- MemBrain (includes signal peptide predictions) claims to outperform several other top scoring TM predicting programs paper(HTML)
see also links here: [69] and the Signal Peptide section
signal peptides
- SignalP
- Phobius (predicts signal peptide, TM and orientation)
- PolyPhobius Homology supported signal peptide predictions
Accepts both single fasts seqence as well as aligned sequences in multiple fasta format (only the top sequence appears in the output).
- Phobius author PhD thesis: [70]
Other Programs:
- SPEPlip Eukaryots/Prokaryots
- TatP Twin-arginine signal peptide cleavage sites in bacteria Paper: [71]
- LipoP lipoprotein signal peptides in Gram-negative bacteria. Paper: [72]
protein structure (no-3D)
- Expasy [73]
- Coils (coiled coil prediction) [74]
- Protscale (various prediction based on AA sequence) [75]
- Jpred (multiple secondary structure predictions run in parallel) [76]
Graphical plots of protein 2D structure
- TMRPres2D standalone Java program. Top choice:
' 'One-step' data input procedure, including easy querying of the public databases and incorporation of output from various popular prediction tools available from the Web.' allows creating graphs with transmembrane regions, signal peptides, disulfide-bridhes, glicosylated residues etc.
- TOPO2 web based
Protein structure databases/comparisons
Metabolic networks
$$ needs more work$$
- Reactome
- KEGG
- BioCyc
- Cyclone - provides an open source Java API for easier access to BioCyc.
- RegulonDB E.coli K12 DB (operons/genes/regulatory elements)
Protein-protein interactions
Large scale methods such as yeast-two-hybrids (Y2H), immuno-co-precipiatation (IP) produce a large number of false positives. Only about 30-50% of interactions seem to be relevant. While newer improvements of these procedures tend to be more accurate, existing databases struggle to separate good quality data.
- IntAct@EBI literature and user submission data.
- DIP
- STRING
- BindingDB
- BioGRID
- Mammalian Protein-Protein Interaction Database @MIPS
- STITCH Chemical-Protein Interactions
Restricted access ($$$ ?):
- BIND / BONDplus (registration required)
Tools
- APID Agile Protein Interaction DataAnalyzer
Papers
Where to find more information
- Bioinformatics Links Directory @UBiC
- Approaches to Web Development for Bioinformatics
- Bioinformatics Resources - Medical Computing .Net
Also there are relevant pages on Wikiomics web site:
Credits
- Darek Kedra wrote this tutorial