User:R. Eric Collins/MBL/BLAST
From OpenWetWare
Jump to navigationJump to search
Bill Pearson wrp@virginia.edu Bill Pearson BLAST lab
- Homology/Orthology
- Homology is a binary trait: it is or it isn't
- in frequent use: "a homolog that does the same thing"
- actual definition: "proteins that diverged by speciation"
- BLAST
- search for proteins
- use options right out of the box! mostly
- lack of long reads causes problems, may have to change parameters
- "I'll never publish 1000 papers, so I choose E() < 1e-3 so I'll probably never be wrong"
- E < 1e-3 really is good/statistically significant
- for a single search... if doing metagenomics need to correct for many searches, but not overcorrect...
- one standard definition of homology: 30% identical over entire length (but this is not very good)
- 23% vs 26% identity: "it's the good identities"
- nonidentities give positive scores because they are conservative replacements (functionally similar)
- on the edge of homology? "i'm a big fan of doing another search"
- then infer homology by transitivity
- "I would use 10^-10 for dna sequences"
- searching smaller databases is faster and more sensitive
- "I feel sorry for you if you don't have protein sequences" in reference to structural RNAs etc.
- Which database to search? the smallest possible
- P(S') is probability of getting score S' when two sequences were compared {0..1}
- E(S') = PD is expectation value based on size of database {0..D} where D is size of database
- Scoring matrices
- PAM based on evolutionary model, extrapolated from similar to distant changes
- small numbers == small changes (large similarity)
- BLOSUM emperically determined across range of distances
- small numbers == small similarity (large changes)
- shallower scoring matrix to extract maximum amount of info from short reads e.g. PAM40 compared to PAM250
- shallow matrices set maximum look-back time
- PAM based on evolutionary model, extrapolated from similar to distant changes
- PSI-BLAST/CS-BLAST
Questions: when expecting rare/low-relatedness sequences like from a virome, what is best strategy relating to scoring matrix and database size?
wget ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR10_blastsets/TAIR10_pep_20101214 formatdb -i TAIR10_pep_20101214
blastp+ -query gstt1_drome.aa -db TAIR10_pep_20101214 > gstt1_drome.blastp.at_out blastx+ -query 8033.nt -db TAIR10_pep_20101214
wget ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR10_blastsets/TAIR10_seq_20101214 formatdb -i TAIR10_seq_20101214 -p F
tblastn -query gstt1_drome.aa -db TAIR10_seq_20101214 > blastn+ -query 8033.nt -db TAIR10_seq_20101214 > 8033.blastn.at_out
blastdbcmd -entry calm_human > calm_human.aa lalign36 calm_human.aa calm_human.aa