Wikiomics:BLAST tutorial

Blast most popular DNA/protein sequence search algorithm tool. See also BLAST.

This tutorial covers previous version of blast (blastall). The newest BLAST+ tools use NCBI C++ Toolkit and offer new functionality not covered here. For detailed help on command line blastall switches type: blastall --help Download blast searches results from

=BLAST input= >MIMI_L4 complement(6238..7602) MPQKTSKSKSSRTRYIEDSDDETRGRSRNRSIEKSRSRSLDKSQKKSRDK SLTRSRSKSPEKSKSRSKSLTRSRSKSPKKCITGNRKNSKHTKKDNEYTT EESDEESDDESDGETNEESDEELDNKSDGESDEEISEESDEEISEESDED VPEEEYDDNDIRNIIIENINNEFARGKFGDFNVIIMKDNGFINATKLCKN
 * sequence as FASTA

proteins: AAA68881.1 129295 gi|129295
 * accession number (web only)
 * gi number

More about input formats is @NCBI

=Databases=

Nucleotide
and many more..
 * nr - All Non-redundant GenBank+EMBL+DDBJ+PDB sequences (no EST, STS, GSS)
 * month - All new GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days
 * est - Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions
 * est_human - human EST sequences
 * est_mouse - mouse EST sequences

Protein
+++ =BLAST algorithms=
 * nr - All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
 * month - GenBank CDS translation+PDB+SP+PIR+PRF released in the last 30 days.
 * swissprot - the last major release of the SWISS-PROT protein sequence database

blastn
>DNA_seq1 CGGGAGGCGGCAGCGGCTGCAGCGTTGGTAGCATCAGCATCAGCATCAGCGGCAGCGGCAGCGGCCTCGG GCGGGGCCGGCCGGACGGACAGGCGGACAGAAGGCGCCAGGGGCGCGCGTCCCGCCCGGGCCGGCCATGG AGGGCGCCTCCTTCGGCGCGGGCCGCGCAGGGGCCGCCCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
 * nucleotide query vs nucleotide database
 * select Somewhat similar sequences (blastn)
 * use nr/nt database

blastp
>MIMI_L4 complement(6238..7602) MPQKTSKSKSSRTRYIEDSDDETRGRSRNRSIEKSRSRSLDKSQKKSRDK SLTRSRSKSPEKSKSRSKSLTRSRSKSPKKCITGNRKNSKHTKKDNEYTT
 * protein query vs protein DB
 * search nr protein database

blastx
>human_genomic_seq TGGACTCTGCTTCCCAGACAGTACCCCTGACAGTGACAGAACTGCCACTCTCCCCACCTG ACCCTGTTAGGAAGGTACAACCTATGAAGAAAAAGCCAGAATACAGGGGACATGTGAGCC ACAGACAACACAAGTGTGCACAACACCTCTGAGCTGAGCTTTTCTTGATTCAAGGGCTAG TGAGAACGCCCCGCCAGAGATTTACCTCTGGTCTTCTGAGGTTGAGGGCTCGTTCTCTCT TCCTGAATGTAAAGGTCAAGATGCTGGGCCTCAGTTTCCTCTTACATACTCACCAAAAGG CTCTCCTGATCAGAGAAGCAGGATGCTGCACTTGTCCTCCTGTCGATGCTCTTGGCTATG ACAAAATCTGAGCTTACCTTCTCTTGCCCACCTCTAAACCCCATAAGGGCTTCGTTCTGT GTCTCTTGAGAATGTCCCTATCTCCAACTCTGTCATACGGGGGAGAGCGAGTGGGAAGGA TCCAGGGCAGGGCTCAGACCCCGGCGCATGGACCTAGTCGGGGGCGCTGGCTCAGCCCCG CCCCGCGCGCCCCCGTCGCAGCCGACGCGCGCTCCCGGGAGGCGGCGGCAGAGGCAGCAT CCACAGCATCAGCAGCCTCAGCTTCATCCCCGGGCGGTCTCCGGCGGGGAAGGCCGGTGG GACAAACGGACAGAAGGCAAAGTGCCCGCAATGGAGGGAGCATCCTTTGGCGCGGGCCGT GCGGGAGCTGCCTTTGATCCCGTGAGCTTTGCGCGGCGGCCCCAGACCCTGTTGCGGGTC GTGTCCTGG
 * translated in 6 frames nucleotide query vs protein DB
 * search human proteins database:
 * BLAST@NCBI ->
 * Human ->
 * select:
 * Database: RefSeq protein
 * Program: BLASTX

tblastn

 * protein query vs translated in 6 frames nucleotide database

>Q9BDJ6|GHRL_BOVIN Appetite-reg.hormone precursor MPAPWTICSLLLLSVLCMDLAMAGSSFLSPEHQKLQRKEAKKPSGRLKPRTLEGQFDPEV GSQAEGAEDELEIRFNAPFNIGIKLAGAQSLQHGQTLGKFLQDILWEEAEETLANE
 * search non-human, non-mouse ESTs

tblastx
>gi|50539273|gb|CO636075.1 Gregarina niphandrodes GCCATTACGGCCGGGGGAAAGAAGTACTACCAACACAATGGTGGCTGATGACAAGATAATCACCACGAAG GTTGTTTCTAAGAAGGTGGTTTCCACCAAGGTGGTCACCGGCGGCGACAAGAAGAAGAAGCATGTTGTTG CCGAATAATTGATCTACATTATTAACTTGTTCATCAACATCCTCAACGACCTCATCATCATCTCCTACAC GATGATCATCGTACACAACTTCTTACTCGACAACTTCTCCTCAATCTTCTATGATGCCACCAACCTCTGC TGCCGCACCACCACTAATACAATTGTTGCTCGTCAAATCCTCAACGCCGTCCGTCTTATCATCCTCGCCG AACTGTTGACCCACGCTACCACCAACGGGGCCACCGCCGCCACCCACTACTACAGACCCGCCTACCAACG TTGTACATACATATTCTACATCATCAACGACTGATGCCACCGAGCTTCTTTCCACGCCTGATCTACTTGC CCACCCAACAACCCCCCTAAAACAAAATCTGTGTCCCCCGTCTCGGCCCGCGTATCCAATATTTCTAGAG CCGCCCCCACCGGTGCGGATCCCCCTCTTGTGCCCCCCTTCACGGGGGTTTATTCCCGCCCGTGGGTGTA CCACAGTGCCCCACACCTGGTCCCCTGGTGCACAAACGTGCTTACCCCCTCTACAC
 * translated in 6 frames nucleotide query vs translated in 6 frames nucleotide database
 * nowadays seldom used except ESTs/genomic sequences of more exotic genomes


 * try blastx first (no sensible hits) then
 * tblastx vith non-human, non-mouse ESTs (some hits to other ESTs)

megaBLAST

 * very fast search for highly similar DNA sequences (95% or more )
 * good for:
 * cDNA vs genome (same or very close species)
 * splice variants of the same gene

PSI-BLAST Position-Specific Iterative BLAST

 * starts with a regular blastp search (protein query vs protein database)
 * uses best hits above a threshold (E -10^5) to create alignment and compute protein profile
 * uses protein profile to perform next iterations of search, adding new best hits to the profile
 * very good at picking distant relationships between proteins

>X.trop_syngr1_NP_001016195.1 MEGGAYGAGKAGGAFDPQTFVRQPHTVLRMVSWVFSIVVFGCIINEGYINASTEAEEHCIFNRNSSACAY GVTVGVLAFLTCLLYLAVDVYFPQISSVKDRKKTVISDIAVSGLWAFFWFVGFCFLANQWQVSNPNDNPM NEGADAARAAIAFSFFSIFTWAGQAVLAYQRYRLGSDSALFSQDYMDPSQDQGPPYPPYASNEDLDPSAG YQQPPSDAYDAGSQGYQTQDY


 * run search with regular blastp, observe hits
 * repeat serch with PSI-BLAST, run 3 iterations, note novel hits

=Command line BLAST= check first if $BLASTDB and $BLASTMAT are set (it should display these names sans $ signs & pathways): env | grep BLAST

If you get a blank and these are not set in your '~/.ncbirc' file (tcsh): setenv $BLASTDB /path/to/directory/with/blast/databases/ setenv $BLASTMAT /path/to/your/director/blast/data or follow instructions for creating '~/.ncbirc' from here:

to create a database out of your fasta file: formatdb -i your_multiple_protein_file.fa -p T -V formatdb -i your_multiple_DNAseq_file.fa -p F -V For more options type 'formatdb -h'

blastall -p blastn -d ecoli_nn -i my_nn_fasta_file.fa -o my_nn_fasta_file_vs_ecolidb.blastn_out blastall -p blastp -d ecoli_aa -i my_nn_fasta_file.fa -o my_nn_fasta_file_vs_ecolidb.blastp_out
 * nucleotide search
 * protein search

=Non-command line BLAST= BLAST can be setup to be queried using an internet browser instead of a command line, using NCBI's wwwblast software. The command line must be used during the setup of wwwblast, but not when running the actual BLAST queries.

=Advanced topics=

Substitution Matrices
For calculating scores of protein query vs putative protein hit BLAST uses so called substitution matrices. These are calculated from multiple protein alignments by counting amino acid frequences in a given columns. You can find more info @NCBI.

By default blastp uses BLOSUM62 ( a matrix constructed from local alignment of proteins with similarity 62% and better). One can use BLOSUM45 for more divergent sequences and BLOSUM80 for closer related ones.

For very short protein queries (15 aa and less) use PAM30 [1].

See also:

1. Jonathan Pevsner, Bioinformatics and Functional Genomics (John Wiley and Sons, 2009), p110.

Word size
BLAST uses 'chunks' of a given length as an initial matches between query sequence and putative match in database. There is a trade off (sensitivity vs speed):
 * shorter the word size, more sensitive the search.
 * longer word sizes, faster the search

Typical word sizes:
 * blastp: 3 (default), 2
 * blastn: 28 (default), up to to 64

More @NCBI

Filters and Masking
Input query sequence may contain:
 * repetitive sequences
 * simple i.e CACACA, proline-rich etc. (filtered by default by DUST/SEG)
 * complex i.e. Alu/LINE1 sequences (filtered by using "Species-specific repeats" and picking the correct organism, if available)
 * CAVEAT: transmembrane and coiled-coil regions are not masked by NCBI BLAST tools and therefore will give a spurious, non-homologous hits. You have to detect them and mask them yourself.

By providing query sequence i.e. genomic sequence with EXONS as capitals and introns in small letters we can search exonic sequences only.
 * custom filtering (lower case masking)

Filtering blast output using Entrez query
One can be interested only in blast hits from i.e. specific organism. On a command line: blastall -u "[Arabidopsis thaliana[organism]" -p blastp -d nr -i my_protein_input.fas -o my_protein_input_vs_nr_Athaliana.blastout.txt

This can also be written as: blastall -u "txid3702[ORGN]" -p blastp -d nr -i my_protein_input.fas -o my_protein_input_vs_nr_Athaliana.blastout.txt

=BLAST implementations= Read:
 * Cha, I.E., and E.C. Rouchka. “Comparison of Current BLAST Software on Nucleotide Sequences.” In Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International, 8 pp., 2005.

Most popular are WU-BLAST and a distant second MPI-BLAST. Things to watch: ScalaBLAST, Grid-based implementation on one side and FPGA/CUDA hardware.

WU-BLAST / AB-BLAST
Often faster implementation of BLAST algorithm from Washington University.
 * http://blast.wustl.edu/

Program was licensed to Advanced Biocomputing and becoming AB-BLAST. Previous licenses are still valid, but WU-BLAST versions except an obsolete 2.0a19 from 1998 are unavailable. Latest AB-BLAST version is 2009-10-30. Licensing is here (1k$ /year/lab, free for personal use on your own computer).

MPI-BLAST
BLAST implementation suitable for clusters, under active development.
 * http://www.mpiblast.org/

Latest version: mpiBLAST-1.6.0 from 2010-07-13.

ScalaBLAST
Novel implementation for clusters, requires MPI.
 * http://hpc.pnl.gov/projects/scalablast/

Latest version: 1.0.640 from 2010-04-05

CS-BLAST
New protein search program claimed to be more (twice) as sensitive as blastp.


 * http://toolkit.lmb.uni-muenchen.de/cs_blast

Biegert, A. and Soding, J. (2009) Sequence context-specific profiles for homology searching. Proc Natl Acad Sci USA, 106 (10), 3770-3775

Latest version: 2.1.0 from 2009-08-17

FSA-BLAST
Fast but not tested/popular implementation by Michael Cameron (ver. 1.05 Released: 8 March 2006)
 * http://fsa-blast.sourceforge.net/

=Hardware speedups= There are several BLAST implementations running on FPGA processors. There are also projects porting BLAST to fast but not mainstream hardware, starting from Cell to GPUs.

DeCypher
from TimeLogic http://www.timelogic.com/decypher_intro.html

BioBoost
from Progeniq http://www.progeniq.com/

Other efforts

 * Mercury BLASTP
 * CAAD BLASTP

=External tutorials/courses=
 * Tutorial_1 @NCBI
 * PSI-BLAST tutorial @NCBI
 * Altschul @NCBI The Statistics of Sequence Similarity Scores
 * BLAST 'Rules of Thumb' etc @NCBI
 * lecture @Knowledge Systems Institute good graphics
 * tutorial @Geospiza