Moore Notes 9 04 12

From OpenWetWare
Jump to navigationJump to search

ISEEM2 Group call for September 4, 2012

  • Participants:
    • Jonathan
    • Guillaume
    • Tom
    • Stephen
    • Josh
  • Stephen's discussion of his work
    • Conducted statistical simulation to see how well SFams recover metagenomic homologs
    • Two search strategies:
      • 1. Classify a set of sequences (a sample) into families across the database. When using top hit, what is accuracy and fidelity of classification. Not much work on this subject
      • 2. Classify a set of sequences into a particular family (e.g., transporters). In past, profiles have been show to detect remote homologs better than pairwise methods. Not using just best hit, but anything above a particular evalue threshold.
    • Stephen has mostly focused on #1 as it's most relevant to our current work. Four major aims:
      • How well are sequences classified into source families for different sequencing technologies
      • What parameters optimize classification
      • How does PD influence classificaiton
      • Compare pairwise (BLAST) versus profile (HMMER) methods
    • Developed a simulation pipeline. Randomly sampled 1000 SFams without replacement, 100 sequences from each SFam with replacement. Simulates intergenic sequence using randomly nucleotides. Grinder simulates reads with parameters to simulate various sequencing technologies (Illumina, 454, Sanger). Conducts leave-one-out analysis and then searches all reads against all families.
    • JAE: might need to optimize thresholds for families depending on their conservation. For TIGR, set cutoffs family by family. Might be able bin families by PD and set thresholds based on PD-bin. Might need to simulate per family.
    • For search strategy #1, he finds that the recall of classification shifts towards the right (1.0) as read length increases. Distributions are almost identical between BLAST and HMM (no sig. difference). The low recall for Illumina is due to reads not being classified. It is not due to misclassification.
    • For same strategy, almost all families have perfect precision. Slightly less precise for HMMER, though it's complicated by evalue. Not much of a shift, but it might be significant.
    • What is the optimal threshold for classifying reads into families? Comparing evalue and coverage finds that evalue has a strong effect on recall whereas coverage tuning doesn't. For long reads, however, tuning evalue has less of an effect and tuning coverage has a much larger effect. Regarding precision, coverage is an important variable for full length precision. For simulated metagenomic sequences, this doesn't seem to be the case.
    • What is going on with coverage? Two answers:
      • (1) Intergenic sequence and
      • (2) Frameshift errors
      • Unless you can strip intergenic sequence
      • Do we need smarter annotation? Frameshifts and introns can create big problems for (meta)genomic annotation, but unclear how much benefit we'd get. So long as we get real results, it's less important that we miss part of the sequence.
    • Finds that BLAST detects remote homologs better than HMMER. Specifically, the True Classification Rate is higher at low PD for HMMER and higher at high PD for BLAST.
      • One possibility is that HMMs are not so good when a family's diversity is very high. Families with high PD may not do a great job at representing a sequence.
  • Dongying
    • How can we group SFams according to their functions? Certain families may be part of biological pathways. They may also be subunits of big proteins.
    • Phylogenetic profiling can clue us in here. Grab completed genomes from IMG and mark whether each SFam is in each genome. Characterize each SFam by the presence/absence profile of its distribution across all genomes.
    • Can then conduct a co-occurrence correlation analysis. But, this is slightly naive as it doesn't take phylogeny into account. If you have 20 E. coli and SFams co-occur, it could be that we're over weighting the occurrence of these families - they may not have had enough time to disjoin for lack of common biological function.
    • Independent contrasts can solve this problem. Build a phylogenetic tree. Instead of counting presence/absence in each tree tip, infer the state of the ancestral node based on its descendants. If all descending tips are present, the node value is set to absent (or something similar). This allows us to subtract out the phylogenetic diversity from the calculation.
    • We then do clustering based on the independent contrast profiling analysis.
    • Want to find families that co-occur and are well distributed across the tree, horizontal gene transfer.