HMM Profile Building Notes

5/2/2009

I've picked 102 gene families with universality of 100, and build 102 hmm profiles.

Here is the protocols:

1. get all the peptides for each family
2. use kalign to align the proteins, and FastTree to build a tree
3. use maxPD.pl to sort the phylogenetic contribution of each gene, manually decide which gene to keep according to
PD contribution
4. In 5 gene families (F11191-11194 and F11199), there are too many genes in the family. I use MCL clustering to
dived the genes into less than 250 clusters and pick one from each cluster to get representatives. step 3 are
followed for the families with only the representatives.
5. MUSCLE to make good alignments for all the selected sequences,and use zorro to trim the alignments (cutoff 1)
6. hmmbuild (-g option) to build hmm profiles for all the families
7. search each hmm profile against its corresponding original peptide file, and see how many of them left genes out
at e value cutoff of 0.01 (30 of them)
8. add leftout sequences back to the seeds after redundancy check, and rebuild hmm profile, this time hmmcalibrate
the models
9. search each hmm profile against the NRAA database, and decide the trusted cutoffs (the bit score when e=0.001)
and noise cutoffs (the bit score when e=0.05)
10.annotation the hmm profile (function assignments), and add information to the 4 fields ACC (accession),
DESC(description), TC (trusted cutoff) and NC (noise cutoff).
11. keep the following: pepfile, seedfile, original_alignments, zorro_mask, trimmed_alignments and the hmm
profiles for future distribution.

I will do the same for the 100 families with top evenness number, but standardize and automate the process first.

5/20/2009
progress summary in the PDF file[1]

HMM Profile Building Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools