HMM Profile Building Notes

From OpenWetWare
Jump to navigationJump to search

I've picked 102 gene families with universality of 100, and build 102 hmm profiles.

Here is the protocols:

1. get all the peptides for each family
2. use kalign to align the proteins, and FastTree to build a tree
3. use to sort the phylogenetic contribution of each gene, manually decide which gene to keep according to
   PD contribution
4. In 5 gene families (F11191-11194 and F11199), there are too many genes in the family. I use MCL clustering to 
   dived the genes into less than 250 clusters and pick one from each cluster to get representatives. step 3 are
   followed for the families with only the representatives.
5. MUSCLE to make good alignments for all the selected sequences,and use zorro to trim the alignments (cutoff 1)
6. hmmbuild (-g option) to build hmm profiles for all the families
7. search each hmm profile against its corresponding original peptide file, and see how many of them left genes out
   at e value cutoff of 0.01 (30 of them)
8. add leftout sequences back to the seeds after redundancy check, and rebuild hmm profile, this time hmmcalibrate
   the models
9. search each hmm profile against the NRAA database, and decide the trusted cutoffs (the bit score when e=0.001)
   and noise cutoffs (the bit score when e=0.05)
10.annotation the hmm profile (function assignments), and add information to the 4 fields ACC (accession), 
   DESC(description), TC (trusted cutoff) and NC (noise cutoff).
11. keep  the following: pepfile, seedfile, original_alignments, zorro_mask, trimmed_alignments and the hmm
    profiles for future distribution.

I will do the same for the 100 families with top evenness number, but standardize and automate the process first.

progress summary in the PDF file[1]