Moore Notes 4 7 10
From OpenWetWare
Jump to navigationJump to search
Group Call
- Steve will check with PLoS about our paper
- Update from Steve
- GOS analysis
- Guillaume and Dongying produced new marker gene families for GOS
- Now have good archaeal markers (old ones were mostly hitting bacteria)
- Steve will run PD analyses on GOS with these markers
- Can now search, align, mask other data sets
- Guillaume and Dongying produced new marker gene families for GOS
- Does phylogenetic distance in tree predict a pair of genome's similarity in terms of 16S copy number?
- KP: take a look at Matt Hahn's CAFE program
- Can adjust relative abundance estimates to account for copy numbers
- JE: would be interesting to do a phyla level analysis of normalized vs. non-normalized vs. protein marker genes (single copy)
- Did this on HOT/ALOHA and human distal gut, and saw a big difference in HOT/ALOHA (Prochlorococcus goes from rare to abundant)
- Will do GOS, shotgun (better) and PCR
- JE: relative abundance could also be adjusted for genome size (since genome size may affect probability of getting a read in 16S)
- AD: could estimate genome size by proportion of reads hitting genome assemblies (compare single vs. multiple copy genes)
- SK: recent ISME paper about correcting for genome size (on citeulike): http://www.citeulike.org/group/6072/article/6912281
- JL: Is Jenna still thinking about estimating abundance? JE: paper is in press, could use her data to test the corrections
- KP: Good idea to test the method on this or some other gold standard data set or simulation
- GOS analysis
- Update from Guillaume
- Generating new protein family HMMs
- HMM search of Dongying's 100 families (universal or big, but not markers per se) vs. IMG
- BLAST all vs. all on all sequences that didn't hit a family
- MCL on these BLAST hits to cluster them into families
- Just finished 20 day process of generating these results
- 350,000 families (6,500 with >75 members)
- MCL may need to be fine-tuned
- Next: will build HMMs for these new protein families that we can use in various analyses
- Will search back against data sets to see if the new HMMs look robust
- Morgan will look at clusters that don't hit any previously described family (or in PFAM but no known function)
- Tom will subdivide existing families into subfamilies, looking for novel subfamilies in metagenomic data
- This subfamily approach could be applied to Morgan's families also
- Goal: data freeze in the next month or so
- Generating new protein family HMMs
- Dongying: What deep coverage data sets are out there besides GOS
- JE: Bejing's human gut data is available in the short read archive at NCBI (maybe BioTorrents - coming soon), much on their website
- Steve: Bejing gut paper assembled Illumina reads into contigs - what do we think about that?
- TS: used Sanger scaffolds to do some testing of the process
- JE: checked some against known assemblies
- AD: the Bejing assembler has not been documented
- Variable coverage breaks most assemblers
- JE: generally not a good idea to use assembled reads for metagenomic analyses (because they generate hybrids)
- OK to leverage contigs/alignments in population level analyses
- Tile reads vs. a reference assembly if you want to compare them to each other
- Best not to use the consensus for downstream
- JE: first Illumina based study that looks good - Illumina might be more promising for metagenomics than originally thought
- Uses much less DNA (~1 microgram) than 454 (20-40 micrograms, though this may be coming down)
- Need to check diversity measures vs. other data sets
- This genome-guided approach get better as more human microbiome genomes are sequenced
- Data here http://gutmeta.genomics.org.cn/