Revision as of 16:32, 7 February 2008

Blast most popular DNA/protein sequence search algorithm tool. See also Wikipedia:BLAST.

Download blast searches results from here

BLAST input

sequence as FASTA

>MIMI_L4 complement(6238..7602)
MPQKTSKSKSSRTRYIEDSDDETRGRSRNRSIEKSRSRSLDKSQKKSRDK
SLTRSRSKSPEKSKSRSKSLTRSRSKSPKKCITGNRKNSKHTKKDNEYTT
EESDEESDDESDGETNEESDEELDNKSDGESDEEISEESDEEISEESDED
VPEEEYDDNDIRNIIIENINNEFARGKFGDFNVIIMKDNGFINATKLCKN

accession number (web only)

proteins:

AAA68881.1

gi number

129295
gi|129295

More about input formats is @NCBI

Databases

Nucleotide

nr - All Non-redundant GenBank+EMBL+DDBJ+PDB sequences (no EST, STS, GSS)
month - All new GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days
est - Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions
est_human - human EST sequences
est_mouse - mouse EST sequences

and many more..

Protein

nr - All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
month - GenBank CDS translation+PDB+SP+PIR+PRF released in the last 30 days.
swissprot - the last major release of the SWISS-PROT protein sequence database

+++

BLAST algorithms

regular BLAST

blastn

nucleotide query vs nucleotide database

>DNA_seq1
CGGGAGGCGGCAGCGGCTGCAGCGTTGGTAGCATCAGCATCAGCATCAGCGGCAGCGGCAGCGGCCTCGG
GCGGGGCCGGCCGGACGGACAGGCGGACAGAAGGCGCCAGGGGCGCGCGTCCCGCCCGGGCCGGCCATGG
AGGGCGCCTCCTTCGGCGCGGGCCGCGCAGGGGCCGCCCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA

select Somewhat similar sequences (blastn)
use nr/nt database

blastp

protein query vs protein DB

>MIMI_L4 complement(6238..7602)
MPQKTSKSKSSRTRYIEDSDDETRGRSRNRSIEKSRSRSLDKSQKKSRDK
SLTRSRSKSPEKSKSRSKSLTRSRSKSPKKCITGNRKNSKHTKKDNEYTT

search nr protein database

blastx

translated in 6 frames nucleotide query vs protein DB

>human_genomic_seq
TGGACTCTGCTTCCCAGACAGTACCCCTGACAGTGACAGAACTGCCACTCTCCCCACCTG
ACCCTGTTAGGAAGGTACAACCTATGAAGAAAAAGCCAGAATACAGGGGACATGTGAGCC
ACAGACAACACAAGTGTGCACAACACCTCTGAGCTGAGCTTTTCTTGATTCAAGGGCTAG
TGAGAACGCCCCGCCAGAGATTTACCTCTGGTCTTCTGAGGTTGAGGGCTCGTTCTCTCT
TCCTGAATGTAAAGGTCAAGATGCTGGGCCTCAGTTTCCTCTTACATACTCACCAAAAGG
CTCTCCTGATCAGAGAAGCAGGATGCTGCACTTGTCCTCCTGTCGATGCTCTTGGCTATG
ACAAAATCTGAGCTTACCTTCTCTTGCCCACCTCTAAACCCCATAAGGGCTTCGTTCTGT
GTCTCTTGAGAATGTCCCTATCTCCAACTCTGTCATACGGGGGAGAGCGAGTGGGAAGGA
TCCAGGGCAGGGCTCAGACCCCGGCGCATGGACCTAGTCGGGGGCGCTGGCTCAGCCCCG
CCCCGCGCGCCCCCGTCGCAGCCGACGCGCGCTCCCGGGAGGCGGCGGCAGAGGCAGCAT
CCACAGCATCAGCAGCCTCAGCTTCATCCCCGGGCGGTCTCCGGCGGGGAAGGCCGGTGG
GACAAACGGACAGAAGGCAAAGTGCCCGCAATGGAGGGAGCATCCTTTGGCGCGGGCCGT
GCGGGAGCTGCCTTTGATCCCGTGAGCTTTGCGCGGCGGCCCCAGACCCTGTTGCGGGTC
GTGTCCTGG

search human proteins database:
- BLAST@NCBI ->
- Human ->
- select:
  - Database: RefSeq protein
  - Program: BLASTX

tblastn

protein query vs translated in 6 frames nucleotide database

>Q9BDJ6|GHRL_BOVIN Appetite-reg.hormone precursor
MPAPWTICSLLLLSVLCMDLAMAGSSFLSPEHQKLQRKEAKKPSGRLKPRTLEGQFDPEV
GSQAEGAEDELEIRFNAPFNIGIKLAGAQSLQHGQTLGKFLQDILWEEAEETLANE

search non-human, non-mouse ESTs

tblastx

translated in 6 frames nucleotide query vs translated in 6 frames nucleotide database
nowadays seldom used except ESTs/genomic sequences of more exotic genomes

>gi|50539273|gb|CO636075.1 Gregarina niphandrodes 
GCCATTACGGCCGGGGGAAAGAAGTACTACCAACACAATGGTGGCTGATGACAAGATAATCACCACGAAG
GTTGTTTCTAAGAAGGTGGTTTCCACCAAGGTGGTCACCGGCGGCGACAAGAAGAAGAAGCATGTTGTTG
CCGAATAATTGATCTACATTATTAACTTGTTCATCAACATCCTCAACGACCTCATCATCATCTCCTACAC
GATGATCATCGTACACAACTTCTTACTCGACAACTTCTCCTCAATCTTCTATGATGCCACCAACCTCTGC
TGCCGCACCACCACTAATACAATTGTTGCTCGTCAAATCCTCAACGCCGTCCGTCTTATCATCCTCGCCG
AACTGTTGACCCACGCTACCACCAACGGGGCCACCGCCGCCACCCACTACTACAGACCCGCCTACCAACG
TTGTACATACATATTCTACATCATCAACGACTGATGCCACCGAGCTTCTTTCCACGCCTGATCTACTTGC
CCACCCAACAACCCCCCTAAAACAAAATCTGTGTCCCCCGTCTCGGCCCGCGTATCCAATATTTCTAGAG
CCGCCCCCACCGGTGCGGATCCCCCTCTTGTGCCCCCCTTCACGGGGGTTTATTCCCGCCCGTGGGTGTA
CCACAGTGCCCCACACCTGGTCCCCTGGTGCACAAACGTGCTTACCCCCTCTACAC

try blastx first (no sensible hits) then
tblastx vith non-human, non-mouse ESTs (some hits to other ESTs)

megaBLAST

very fast search for highly similar DNA sequences (95% or more )
good for:
- cDNA vs genome (same or very close species)
- splice variants of the same gene

PSI-BLAST Position-Specific Iterative BLAST

starts with a regular blastp search (protein query vs protein database)
uses best hits above a threshold (E -10^5) to create alignment and compute protein profile
uses protein profile to perform next iterations of search, adding new best hits to the profile
very good at picking distant relationships between proteins

>X.trop_syngr1_NP_001016195.1
MEGGAYGAGKAGGAFDPQTFVRQPHTVLRMVSWVFSIVVFGCIINEGYINASTEAEEHCIFNRNSSACAY
GVTVGVLAFLTCLLYLAVDVYFPQISSVKDRKKTVISDIAVSGLWAFFWFVGFCFLANQWQVSNPNDNPM
NEGADAARAAIAFSFFSIFTWAGQAVLAYQRYRLGSDSALFSQDYMDPSQDQGPPYPPYASNEDLDPSAG
YQQPPSDAYDAGSQGYQTQDY

run search with regular blastp, observe hits
repeat serch with PSI-BLAST, run 3 iterations, note novel hits

Command line BLAST

check first if $BLASTDB and $BLASTMAT are set (it should display these names sans $ signs & pathways):

env | grep BLAST

If you get a blank and these are not set in your '~/.ncbirc' file (tcsh):

setenv $BLASTDB /path/to/directory/with/blast/databases/
setenv $BLASTMAT /path/to/your/director/blast/data

or follow instructions for creating '~/.ncbirc' from here: [1]

to create a database out of your fasta file:

formatdb -i your_multiple_protein_file.fa -p T -V
formatdb -i your_multiple_DNAseq_file.fa  -p F -V

For more options type 'formatdb -h'

- nucleotide search

blastall -p blastn -d ecoli_nn -i my_nn_fasta_file.fa -o my_nn_fasta_file_vs_ecolidb.blastn_out

- protein search

blastall -p blastp -d ecoli_aa -i my_nn_fasta_file.fa -o my_nn_fasta_file_vs_ecolidb.blastp_out

Advanced topics

Substitution Matrices

For calculating scores of protein query vs putative protein hit BLAST uses so called substitution matrices. These are calculated from multiple protein alignments by counting amino acid frequences in a given columns. You can find more info @NCBI.

By default blastp uses BLOSUM62 ( a matrix constructed from local alignment of proteins with similarity 62% and better). One can use BLOSUM45 for more divergent sequences and BLOSUM80 for closer related ones.

Word size

BLAST uses 'chunks' of a given length as an initial matches between query sequence and putative match in database. There is a trade off (sensitivity vs speed):

shorter the word size, more sensitive the search.
longer word sizes, faster the search

Typical word sizes:

blastp: 3 (default), 2
blastn: 28 (default), up to to 64

More @NCBI

Filters and Masking

Input query sequence may contain:

repetitive sequences
- simple i.e CACACA, proline-rich etc. (filtered by default by DUST/SEG)
- complex i.e. Alu/LINE1 sequences (filtered by using "Species-specific repeats" and picking the correct organism, if available)
- CAVEAT: transmembrane and coiled-coil regions are not masked by NCBI BLAST tools and therefore will give a spurious, non-homologous hits. You have to detect them and mask them yourself.

custom filterin )lower case masking)

By providing query sequence i.e. genomic sequence with EXONS as capitals and introns in small letters we can search exonic sequences only.

BLAST implementations

Read:

Cha, I.E., and E.C. Rouchka. “Comparison of Current BLAST Software on Nucleotide Sequences.” In Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International, 8 pp., 2005.

WU-BLAST

Often faster implementation of BLAST algorithm from Washington University.

http://blast.wustl.edu/

MPI-BLAST

BLAST implementation suitable for clusters

http://www.mpiblast.org/

External tutorials/courses

@@ Line 1: / Line 1: @@
 [http://www.ncbi.nlm.nih.gov/BLAST Blast] most popular DNA/protein sequence search algorithm tool. See also [[Wikipedia:BLAST]].
+Download blast searches results from [http://openwetware.org/wiki/Image:Blast_course_080208.zip here]
 =BLAST input=

Wikiomics:BLAST tutorial: Difference between revisions