User:R. Eric Collins/MBL/BLAST

Bill Pearson wrp@virginia.edu Bill Pearson BLAST lab

Homology/Orthology
- Homology is a binary trait: it is or it isn't
- in frequent use: "a homolog that does the same thing"
- actual definition: "proteins that diverged by speciation"

BLAST
- search for proteins
- use options right out of the box! mostly
- lack of long reads causes problems, may have to change parameters
- "I'll never publish 1000 papers, so I choose E() < 1e-3 so I'll probably never be wrong"
- E < 1e-3 really is good/statistically significant
  - for a single search... if doing metagenomics need to correct for many searches, but not overcorrect...
  - one standard definition of homology: 30% identical over entire length (but this is not very good)
- 23% vs 26% identity: "it's the good identities"
- nonidentities give positive scores because they are conservative replacements (functionally similar)
- on the edge of homology? "i'm a big fan of doing another search"
  - then infer homology by transitivity
- "I would use 10^-10 for dna sequences"
- searching smaller databases is faster and more sensitive
- "I feel sorry for you if you don't have protein sequences" in reference to structural RNAs etc.
- Which database to search? the smallest possible
- P(S') is probability of getting score S' when two sequences were compared {0..1}
- E(S') = PD is expectation value based on size of database {0..D} where D is size of database
- Scoring matrices
  - PAM based on evolutionary model, extrapolated from similar to distant changes
    - small numbers == small changes (large similarity)
  - BLOSUM emperically determined across range of distances
    - small numbers == small similarity (large changes)
  - shallower scoring matrix to extract maximum amount of info from short reads e.g. PAM40 compared to PAM250
    - shallow matrices set maximum look-back time

- PSI-BLAST/CS-BLAST
  - http://en.wikipedia.org/wiki/CS-BLAST

Questions: when expecting rare/low-relatedness sequences like from a virome, what is best strategy relating to scoring matrix and database size?

Command Line BLAST Exercise

wget ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR10_blastsets/TAIR10_pep_20101214
formatdb -i TAIR10_pep_20101214

blastp+ -query gstt1_drome.aa -db TAIR10_pep_20101214 > gstt1_drome.blastp.at_out
blastx+ -query 8033.nt -db TAIR10_pep_20101214

wget ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR10_blastsets/TAIR10_seq_20101214
formatdb -i TAIR10_seq_20101214 -p F

tblastn -query gstt1_drome.aa -db TAIR10_seq_20101214 > 
blastn+ -query 8033.nt -db TAIR10_seq_20101214 > 8033.blastn.at_out

blastdbcmd -entry calm_human > calm_human.aa
lalign36 calm_human.aa calm_human.aa

User:R. Eric Collins/MBL/BLAST

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools