User:R. Eric Collins/MBL/BLAST

From OpenWetWare
Jump to: navigation, search

Bill Pearson Bill Pearson BLAST lab

  • Homology/Orthology
    • Homology is a binary trait: it is or it isn't
    • in frequent use: "a homolog that does the same thing"
    • actual definition: "proteins that diverged by speciation"
    • search for proteins
    • use options right out of the box! mostly
    • lack of long reads causes problems, may have to change parameters
    • "I'll never publish 1000 papers, so I choose E() < 1e-3 so I'll probably never be wrong"
    • E < 1e-3 really is good/statistically significant
      • for a single search... if doing metagenomics need to correct for many searches, but not overcorrect...
      • one standard definition of homology: 30% identical over entire length (but this is not very good)
    • 23% vs 26% identity: "it's the good identities"
    • nonidentities give positive scores because they are conservative replacements (functionally similar)
    • on the edge of homology? "i'm a big fan of doing another search"
      • then infer homology by transitivity
    • "I would use 10^-10 for dna sequences"
    • searching smaller databases is faster and more sensitive
    • "I feel sorry for you if you don't have protein sequences" in reference to structural RNAs etc.
    • Which database to search? the smallest possible
    • P(S') is probability of getting score S' when two sequences were compared {0..1}
    • E(S') = PD is expectation value based on size of database {0..D} where D is size of database
    • Scoring matrices
      • PAM based on evolutionary model, extrapolated from similar to distant changes
        • small numbers == small changes (large similarity)
      • BLOSUM emperically determined across range of distances
        • small numbers == small similarity (large changes)
      • shallower scoring matrix to extract maximum amount of info from short reads e.g. PAM40 compared to PAM250
        • shallow matrices set maximum look-back time

Questions: when expecting rare/low-relatedness sequences like from a virome, what is best strategy relating to scoring matrix and database size?

Command Line BLAST Exercise

formatdb -i TAIR10_pep_20101214
blastp+ -query gstt1_drome.aa -db TAIR10_pep_20101214 > gstt1_drome.blastp.at_out
blastx+ -query 8033.nt -db TAIR10_pep_20101214
formatdb -i TAIR10_seq_20101214 -p F
tblastn -query gstt1_drome.aa -db TAIR10_seq_20101214 > 
blastn+ -query 8033.nt -db TAIR10_seq_20101214 > 8033.blastn.at_out

blastdbcmd -entry calm_human > calm_human.aa
lalign36 calm_human.aa calm_human.aa