Wikiomics:BLAST tutorial: Difference between revisions
Darek Kedra (talk | contribs) m (→tblastx) |
Darek Kedra (talk | contribs) m (file link) |
||
Line 1: | Line 1: | ||
[http://www.ncbi.nlm.nih.gov/BLAST Blast] most popular DNA/protein sequence search algorithm tool. See also [[Wikipedia:BLAST]]. | [http://www.ncbi.nlm.nih.gov/BLAST Blast] most popular DNA/protein sequence search algorithm tool. See also [[Wikipedia:BLAST]]. | ||
Download blast searches results from [http://openwetware.org/wiki/Image:Blast_course_080208.zip here] | |||
=BLAST input= | =BLAST input= |
Revision as of 16:32, 7 February 2008
Blast most popular DNA/protein sequence search algorithm tool. See also Wikipedia:BLAST.
Download blast searches results from here
BLAST input
- sequence as FASTA
>MIMI_L4 complement(6238..7602) MPQKTSKSKSSRTRYIEDSDDETRGRSRNRSIEKSRSRSLDKSQKKSRDK SLTRSRSKSPEKSKSRSKSLTRSRSKSPKKCITGNRKNSKHTKKDNEYTT EESDEESDDESDGETNEESDEELDNKSDGESDEEISEESDEEISEESDED VPEEEYDDNDIRNIIIENINNEFARGKFGDFNVIIMKDNGFINATKLCKN
- accession number (web only)
proteins:
AAA68881.1
- gi number
129295 gi|129295
More about input formats is @NCBI
Databases
Nucleotide
- nr - All Non-redundant GenBank+EMBL+DDBJ+PDB sequences (no EST, STS, GSS)
- month - All new GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days
- est - Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions
- est_human - human EST sequences
- est_mouse - mouse EST sequences
and many more..
Protein
- nr - All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
- month - GenBank CDS translation+PDB+SP+PIR+PRF released in the last 30 days.
- swissprot - the last major release of the SWISS-PROT protein sequence database
+++
BLAST algorithms
regular BLAST
blastn
- nucleotide query vs nucleotide database
>DNA_seq1 CGGGAGGCGGCAGCGGCTGCAGCGTTGGTAGCATCAGCATCAGCATCAGCGGCAGCGGCAGCGGCCTCGG GCGGGGCCGGCCGGACGGACAGGCGGACAGAAGGCGCCAGGGGCGCGCGTCCCGCCCGGGCCGGCCATGG AGGGCGCCTCCTTCGGCGCGGGCCGCGCAGGGGCCGCCCTGGACCCCGTGAGCTTTGCGCGGCGGCCCCA
- select Somewhat similar sequences (blastn)
- use nr/nt database
blastp
- protein query vs protein DB
>MIMI_L4 complement(6238..7602) MPQKTSKSKSSRTRYIEDSDDETRGRSRNRSIEKSRSRSLDKSQKKSRDK SLTRSRSKSPEKSKSRSKSLTRSRSKSPKKCITGNRKNSKHTKKDNEYTT
- search nr protein database
blastx
- translated in 6 frames nucleotide query vs protein DB
>human_genomic_seq TGGACTCTGCTTCCCAGACAGTACCCCTGACAGTGACAGAACTGCCACTCTCCCCACCTG ACCCTGTTAGGAAGGTACAACCTATGAAGAAAAAGCCAGAATACAGGGGACATGTGAGCC ACAGACAACACAAGTGTGCACAACACCTCTGAGCTGAGCTTTTCTTGATTCAAGGGCTAG TGAGAACGCCCCGCCAGAGATTTACCTCTGGTCTTCTGAGGTTGAGGGCTCGTTCTCTCT TCCTGAATGTAAAGGTCAAGATGCTGGGCCTCAGTTTCCTCTTACATACTCACCAAAAGG CTCTCCTGATCAGAGAAGCAGGATGCTGCACTTGTCCTCCTGTCGATGCTCTTGGCTATG ACAAAATCTGAGCTTACCTTCTCTTGCCCACCTCTAAACCCCATAAGGGCTTCGTTCTGT GTCTCTTGAGAATGTCCCTATCTCCAACTCTGTCATACGGGGGAGAGCGAGTGGGAAGGA TCCAGGGCAGGGCTCAGACCCCGGCGCATGGACCTAGTCGGGGGCGCTGGCTCAGCCCCG CCCCGCGCGCCCCCGTCGCAGCCGACGCGCGCTCCCGGGAGGCGGCGGCAGAGGCAGCAT CCACAGCATCAGCAGCCTCAGCTTCATCCCCGGGCGGTCTCCGGCGGGGAAGGCCGGTGG GACAAACGGACAGAAGGCAAAGTGCCCGCAATGGAGGGAGCATCCTTTGGCGCGGGCCGT GCGGGAGCTGCCTTTGATCCCGTGAGCTTTGCGCGGCGGCCCCAGACCCTGTTGCGGGTC GTGTCCTGG
- search human proteins database:
- BLAST@NCBI ->
- Human ->
- select:
- Database: RefSeq protein
- Program: BLASTX
tblastn
- protein query vs translated in 6 frames nucleotide database
>Q9BDJ6|GHRL_BOVIN Appetite-reg.hormone precursor MPAPWTICSLLLLSVLCMDLAMAGSSFLSPEHQKLQRKEAKKPSGRLKPRTLEGQFDPEV GSQAEGAEDELEIRFNAPFNIGIKLAGAQSLQHGQTLGKFLQDILWEEAEETLANE
- search non-human, non-mouse ESTs
tblastx
- translated in 6 frames nucleotide query vs translated in 6 frames nucleotide database
- nowadays seldom used except ESTs/genomic sequences of more exotic genomes
>gi|50539273|gb|CO636075.1 Gregarina niphandrodes GCCATTACGGCCGGGGGAAAGAAGTACTACCAACACAATGGTGGCTGATGACAAGATAATCACCACGAAG GTTGTTTCTAAGAAGGTGGTTTCCACCAAGGTGGTCACCGGCGGCGACAAGAAGAAGAAGCATGTTGTTG CCGAATAATTGATCTACATTATTAACTTGTTCATCAACATCCTCAACGACCTCATCATCATCTCCTACAC GATGATCATCGTACACAACTTCTTACTCGACAACTTCTCCTCAATCTTCTATGATGCCACCAACCTCTGC TGCCGCACCACCACTAATACAATTGTTGCTCGTCAAATCCTCAACGCCGTCCGTCTTATCATCCTCGCCG AACTGTTGACCCACGCTACCACCAACGGGGCCACCGCCGCCACCCACTACTACAGACCCGCCTACCAACG TTGTACATACATATTCTACATCATCAACGACTGATGCCACCGAGCTTCTTTCCACGCCTGATCTACTTGC CCACCCAACAACCCCCCTAAAACAAAATCTGTGTCCCCCGTCTCGGCCCGCGTATCCAATATTTCTAGAG CCGCCCCCACCGGTGCGGATCCCCCTCTTGTGCCCCCCTTCACGGGGGTTTATTCCCGCCCGTGGGTGTA CCACAGTGCCCCACACCTGGTCCCCTGGTGCACAAACGTGCTTACCCCCTCTACAC
- try blastx first (no sensible hits) then
- tblastx vith non-human, non-mouse ESTs (some hits to other ESTs)
megaBLAST
- very fast search for highly similar DNA sequences (95% or more )
- good for:
- cDNA vs genome (same or very close species)
- splice variants of the same gene
PSI-BLAST Position-Specific Iterative BLAST
- starts with a regular blastp search (protein query vs protein database)
- uses best hits above a threshold (E -10^5) to create alignment and compute protein profile
- uses protein profile to perform next iterations of search, adding new best hits to the profile
- very good at picking distant relationships between proteins
>X.trop_syngr1_NP_001016195.1 MEGGAYGAGKAGGAFDPQTFVRQPHTVLRMVSWVFSIVVFGCIINEGYINASTEAEEHCIFNRNSSACAY GVTVGVLAFLTCLLYLAVDVYFPQISSVKDRKKTVISDIAVSGLWAFFWFVGFCFLANQWQVSNPNDNPM NEGADAARAAIAFSFFSIFTWAGQAVLAYQRYRLGSDSALFSQDYMDPSQDQGPPYPPYASNEDLDPSAG YQQPPSDAYDAGSQGYQTQDY
- run search with regular blastp, observe hits
- repeat serch with PSI-BLAST, run 3 iterations, note novel hits
Command line BLAST
check first if $BLASTDB and $BLASTMAT are set (it should display these names sans $ signs & pathways):
env | grep BLAST
If you get a blank and these are not set in your '~/.ncbirc' file (tcsh):
setenv $BLASTDB /path/to/directory/with/blast/databases/ setenv $BLASTMAT /path/to/your/director/blast/data
or follow instructions for creating '~/.ncbirc' from here: [1]
to create a database out of your fasta file:
formatdb -i your_multiple_protein_file.fa -p T -V formatdb -i your_multiple_DNAseq_file.fa -p F -V
For more options type 'formatdb -h'
- nucleotide search
blastall -p blastn -d ecoli_nn -i my_nn_fasta_file.fa -o my_nn_fasta_file_vs_ecolidb.blastn_out
- protein search
blastall -p blastp -d ecoli_aa -i my_nn_fasta_file.fa -o my_nn_fasta_file_vs_ecolidb.blastp_out
Advanced topics
Substitution Matrices
For calculating scores of protein query vs putative protein hit BLAST uses so called substitution matrices. These are calculated from multiple protein alignments by counting amino acid frequences in a given columns. You can find more info @NCBI.
By default blastp uses BLOSUM62 ( a matrix constructed from local alignment of proteins with similarity 62% and better). One can use BLOSUM45 for more divergent sequences and BLOSUM80 for closer related ones.
Word size
BLAST uses 'chunks' of a given length as an initial matches between query sequence and putative match in database. There is a trade off (sensitivity vs speed):
- shorter the word size, more sensitive the search.
- longer word sizes, faster the search
Typical word sizes:
- blastp: 3 (default), 2
- blastn: 28 (default), up to to 64
More @NCBI
Filters and Masking
Input query sequence may contain:
- repetitive sequences
- simple i.e CACACA, proline-rich etc. (filtered by default by DUST/SEG)
- complex i.e. Alu/LINE1 sequences (filtered by using "Species-specific repeats" and picking the correct organism, if available)
- CAVEAT: transmembrane and coiled-coil regions are not masked by NCBI BLAST tools and therefore will give a spurious, non-homologous hits. You have to detect them and mask them yourself.
- custom filterin )lower case masking)
By providing query sequence i.e. genomic sequence with EXONS as capitals and introns in small letters we can search exonic sequences only.
BLAST implementations
Read:
- Cha, I.E., and E.C. Rouchka. “Comparison of Current BLAST Software on Nucleotide Sequences.” In Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International, 8 pp., 2005.
WU-BLAST
Often faster implementation of BLAST algorithm from Washington University.
MPI-BLAST
BLAST implementation suitable for clusters