Drosophila 16S Paper
From Jenna: In any section or thread, my most recent comments that are left unanswered/unresolved/in limbo in any way will be in green. As they are addressed, I will turn them into black text. If there is something that I think needs immediate action/answering/reading or otherwise seems important, I will make it red. This is a trial strategy, so it may change.
Technical Questions for Thought/Discussion
- how will Bellerophon handle incomplete sequences? Does the type of incompleteness matter, i.e. hole in the middle vs. only one half of the clone?
- Looks to me like most of the putative chimeras were classified to the genus level. Here's my questions to all: do we throw these out now, or do we investigate them further in an attempt to keep as many as possible in our analysis?--Jenna L Morgan 14:47, 20 April 2009 (EDT)
- So there are 291 total Chimeras, and they do not seem to be evenly distributed throughout the libraries. DmWMah is 17% chimeric, DmLCAN is 28% chimeric, and HCF and HPP are both 25% chimeric, whereas all the others are around 4%. If they were evenly distributed I would say throw them out, but for these specific libraries they represent a significant portion of the data. --James Angus Chandler 16:59, 22 April 2009 (EDT)
30% of the chimeras were not classified at the Genus level and 4% were not classified at the family or Order level --James Angus Chandler 17:17, 22 April 2009 (EDT).
* I looked at these alignments, and while a few of the taxa were missing a lot of the gene, most were (mostly) full-length. Talked to Jonathan. He and I agree that since we have these objective means by which to ensure that our final analyses are based on high-quality data, that we should rely upon them, rather than second-guess them and waste time sifting through these putative chimeras. Perhaps these sequences are not actually chimeric, but they are likely funky in some measurable way. So, unless we have a specific reason to analyze these on an individual basis, let's move forward with them removed.--Jenna L Morgan 18:02, 23 April 2009 (EDT) Media:Lab_unclassifieds.pdf
- So here is a graph of the number of unclassified clones in all of our laboratory samples. The overall question will be what do we do with these clones, and just looking at our lab samples is a good place to start. It is hard for me to imagine that this level of "unclassified" organisms results because they are actually novel or not found in greengenes. This is more likely a technical problem. As an example, DO and DM are both melanogaster on lab media, however the discrepancy between them is hardly trivial.
**Overall, what we are to do with libraries like this, or with individual unclassified clones, is a question that I think needs to be resolved before we can get any credible information from the "classified" clones.
*** I am having related issues with my "metagenomic simulation" project for which I created 16S PCR libraries from a mixture of 10 organisms that have all had their genomes sequenced. No one wants to hear this, but some of the sequences are mis-classified or unclassified. Please understand that following are just thoughts I'm mulling over:
- Some of the sequences are likely from organisms that are contaminants.
- Some of these are being under- or mis-classified because some part(s) of the sequence is poor quality.
- Something is wrong with greengenes. I've been using greengenes for this project recently because it has a rapid turnaround relative to STAP, and I've needed that quality for troubleshooting purposes.
- Angus, can you do this?:
- make a list of organisms that are "unclassified"
- use the fetch script to make a fasta file of those sequences - this fasta file has been uploaded to the Data Files section --James Angus Chandler 19:30, 24 April 2009 (EDT)
- blast those sequences against the ncbi database (if you can run this from the command-line, use the -m 8 option, if not, send me an email asap so I can write the script to parse standard blast output)
- post the blast results here
In the meantime, I will write a script that will give us the top blast hit along with the percent identity for each hit. This will give us a flag for those "unclassified" sequences that maybe SHOULD actually be classified. This seems like a good way to start dealing with this. --Jenna L Morgan 14:46, 20 April 2009 (EDT)
Jenna, any update on the status of the unclassifieds? or is their anything I should be doing at the moment? --James Angus Chandler 13:25, 7 May 2009 (EDT)
I found that some of the sequences had been screened for vector but not quality. So, I'm doing the quality trim, realigning, and then running everything through the automated workflow.--Jenna L Morgan 01:41, 16 May 2009 (EDT)
- I agree, looking in detail at the lab samples is a good way to figure out where the "unclassified" bugs are coming from and why. I also agree it has to be done before moving on.
- I also have some questions:
- Is the proportion of chimeric/unclassified clones equally high in "natural" samples, or do you only see this in the lab samples?
Looking at it very roughly: 35% of the lab clones are unclassified and 22% of the wild clones are unclassified. If you want a graph showing it library by library I can do that easily. --James Angus Chandler 13:25, 7 May 2009 (EDT)
- Are all or most unclassified clones chimeric?
- When you run the chimera-detection analysis, does it show you the putative recombination breakpoints (sort of like 4-haplotype test in population genetics)? If so, if some libraries are especially valuable, could Blast each half separately.
I think that the putative breakpoints are available in the excel sheet that is returned by the chimera checker. We should wait and see what things look like once the low-quality stuff is removed, though.--Jenna L Morgan 01:41, 16 May 2009 (EDT)
Scientific Questions for Thought/Discussion
Media:79_9Categories.pdf Here is a graph of all of our libraries (minus 3A, 3B, Ns and Wolbachia) at the genus level. The interesting thing to note is that taking all of our samples as a whole (lab/wild/interior/exterior/everything...), we get that a full 79% of the clones can be put into just 9 different categories. Although there is hardly a "core microbiome" since no genus is present in all the samples, it is cool to see that the same few things make up such a large percentage of our data.
- Hey guys, what is Haemophilus doing there? Are we sure? Has it been found in other insects? Artyom
A search of my harddrive (which has about every culture-independent fly and insect microbe paper ever written on it) for Haemophilus came up empty. An interesting thing to add is that one library completely dominated by haemophilus is mojavensis/arizonae, and I cannot find any previous work that mentions catcophilic flies being associated with this particular microbe --James Angus Chandler 22:53, 30 April 2009 (EDT)
This also brings up several technical questions. When we want to look at our data as a whole and make broad conclusions, which libraries should be included? There is the obvious distintions of lab/wild and interior/exterior/guts, but what about the different sequencing methods, sanger/phylochip, and the different DNA extraction methods. To answer the bigger question of just was a typical fly communities is composed of, I want to use as much data as possible. But many of our libraries (and not to even mention the corby-harris and Cox-Gilmore data) are slightly different from each other.
- I know this is a pain technically, but I think this will have to vary from question to question. For a question like "Is only a non-random subset of known Lactobacilli found in flies?" you'd probably want to include all the samples. For a question like "Are fruit-feeders as a group different from flower-feeders as a group?" you will need to exclude all lab samples. For a question "Are lab flies different from wild flies" or the effect of diet you will only include melanogaster samples, etc. With this in mind, it is probably better to invest some effort upfront to make your data format so flexible that including/excluding any subset of samples for a particular analysis is very easy.
- And I have no idea how to combine Sanger and phylochip data... that might just have to be a separate comparison.
Media:74_7Genera.pdf Here is the same as above, except that I have removed ALL the unclassified clones from the analysis. Note that now some of the libraries (Dml_CANS, DmW_Turr1/2, IMH) are now mainly composed of those genera not found in all the other libraries. We need to resolve if these differences are due to actual biology or just technical difficulties. But it still good to see that only a few genera make up a substantial portion of the dataset.
Lab Flies only: Internal vs External, unclassifieds still included-Media:intVSext+unC.pdf
Lab Flies Only: Internal vs External, gammaproteo and enterobacter unclassieds removed-Media:intVSext-unC.pdf Only D. melanogaster is represented in the above two files. The "grand total" column is the average of all the internal samples. The "external" column is the average of 4 external and media samples taken from both the Kimbrell and the Kopp labs, and hopefully is a good representation of what the flies are exposed to in the lab. Things to note: It seems as though the only genus that is enriched inside the flies is Providencia. In fact, Acetobacter and Lactobacillus, the two genera that we have been focusing on most heavily, seem to make up a relatively small proportion of the internal community. Next, it is interesting how the lines from the Kimbrell lab (all those that begin with DmL) are mostly Shigella, Microbacterium and Variovorax, while the Kopp lab samples (all the others) do not have those three Genera at all.
- Try re-plotting both data sets after excluding Providentia. You might notice more subtle differences between internal and external that are now obscured by Providentia.
- Also, for visual help, can you make "total Kimbrell lab" and "total Kopp lab" bars on these graphs?
Table of all Samples Sequenced
Pretty self explanatory, but here is a table of all the libraries for which we have sequence data. The first column is the identifier that I will be using in all of my graphs and communications.
- Apparently, there was some miscommunication about some samples. First of all, no "PhyloChip data" has left my laptop. I put 8 PCR products on the PhyloChip. I also cloned and sequenced the same 8 PCR products (~96 clones each). Also, in May, 2008, at Turelli's, I only collected on Pomegranates, not Citrus, and in September, 2007, I only collected on Citrus, not Pomegranates. We collected both at Michael's and at the Davis experimental orchard on that day. I think that all of the hydei came from the experimental orchard, but I can't be sure. I do know that we ONLY collected on citrus that day.
My mistake about the phylochip data, I must have misunderstood you earlier. Regarding the May 2008 collections, I am pretty sure that I made those collections and that they were on Citrus. I have a folder at home that outlines all of our collections and I will confirm this.--James Angus Chandler 16:39, 22 April 2009 (EDT)
- sorry, my name was in the "collector" column for those May, 2008 samples, so I thought those were mine and I know that the last time I collected at Turelli's, it was on pomegranates.--Jenna L Morgan 13:12, 23 April 2009 (EDT)
- I updated Angus' excel sheet and uploaded a new version. Now, if you want to change it, you can click on the link above. This will allow us to upload updated versions as well as keep track of old versions.--Jenna L Morgan 22:31, 19 April 2009 (EDT)
- Did we ever figure out why the libraries SCA and HPM have twice as many clones as all the others done at the same time? I have looked back at my notes and I SCA was one of the four original survey samples that was done first, so it is conceivable that it then went through again at a later date.
- Additional sequencing was done for these libraries when there was some extra room on some plates.--Jenna L Morgan 21:47, 19 April 2009 (EDT)
Example of Replication
Here is the best example we have of either biological or technical replication with our samples. The first four samples (DmW) are melanogaster collected on Citrus Fruit at Michael's orchard in August 2007. They are all 10 pooled dissected guts. The first one was done with bead-beating, while the second one the bead-beating was omitted. The amplified DNA was then hybridized to a phylochip. The next two samples were done with bead-beating and cloning and sanger sequencing. HCF and ICF are from D. hydei and D. immigrans collected off of Citrus Fruit at Michael's Orchard (and another one close by) in May 2008. They are both guts and at least 5 individuals.
- Actually, these are all mixed up (see corrections made to spreadsheet above), but we should either re-do these with the putative chimeric sequences removed, or convince ourselves somehow that it is OK to leave them in.--Jenna L Morgan 22:56, 19 April 2009 (EDT)
Things to Note: The two different DNA extraction methods yield roughly the same outcomes. I didn't upload the graph, but once wolbachia is removed, the relative amounts of each genera is pretty much the same. Next, nothing convincing can be seen by comparing the two sanger sequenced replicates to each other or by comparing them to the Phylochip data, since they each are about 80% wolbachia. The one good thing to take out of this may be that if we are concerned that a population is infected with wolbachia, the bead-beating step could be skipped and this may not drastically change the results.
- I hear what you're saying about skipping the bead-beating. This would make sense if we presume that the only effect of the bead-beating is to disrupt the drosophila cells. However, I would be surprised to find that this was the only effect. Instead of saying that they are roughly the same, let's test them. I think a simple Chi-square test would be sufficient.--Jenna L Morgan 22:56, 19 April 2009 (EDT)
I included the ICF and HCF samples on this graph because they are the only examples we have of returning to a location in a different year and sampling on the same substrate. Unfortunately, both samples are dominated by unclassified clones. This may mean something biological (novel bacteria!) or could just be a technical artifact. Either way, it seems clear that the most prevalent genus in melanogaster from the year before (Acetobacter), is not present in these two samples.
- I think this conclusion is a consequence of the mix-up.--Jenna L Morgan 22:56, 19 April 2009 (EDT)
Finally, check out the Mesorhizobium! I can't say that I was expected to see 56 clones of that.
- I wasn't surprised by this because these flies were eating citrus that was lying on (sometimes smushed into) the soil. Assuming that these are root-associated bacteria, they are likely to be abundant in the soil.--Jenna L Morgan 22:56, 19 April 2009 (EDT)
Materials and Methods
I can edit too
Artyom Just testing - looks like I can edit?
Jenna <- click here to see what I'm up to!