Simulation parameters being decided
This is a basic outline of the steps in the simulation pipeline. The steps and parameters that are currently being discussed and chosen are in bold. Please add your requests/suggestions/feedback at the bottom!
- Pick a gene family from AMPHORA.
- Pick a subset of AMPHORA Reference Sequences for this gene family to be the database for this run of simulations.
- How do we pick the database? Uniformly over all AMPHORA Reference Sequences? Or do we choose them to maximize phylogenetic diversity? Something in between these extremes?
- How many sequences do we choose for the database? We have been considering 100. We could use all of AMPHORA Reference Sequences, but that may unnecessarily slow down our future analyses.
- Retrieve all the database sequences (as DNA sequences) from NCBI.
- Produce the DNA reads: i.e., for each simulation, choose N sequences to be fragmented into reads (by MetaSim).
- How many sequences do we choose for fragmentation, i.e., what should N be? We can choose a few different values for the first few runs -- what should the range be? (15? 100? 1000?)
- From what set do we choose these sequences? Do they come only from the database? Or do we draw them from the subset of AMPHORA Reference Sequences that are not in the database? Or do we draw them from the full set of AMPHORA Reference Sequences?
- If any of the sequences are drawn from the database, do we remove these sequences from the database for the latter parts of the simulation? If we do this, the database will be slightly different for each simulation.
- In any case, how do we choose the sequences for fragmentation? Do we choose them uniformly at random, or in some other way?
- Should we filter the file of reads at this stage so that there is at most one read per original sequence, or at most some other limit? Or should we filter later, after the matching to the profile HMM (see below)?
- Blastx the reads to find the best frame(s) for translation of the reads, and translate them into peptide reads.
- Which sequences are in the blast database? The database of sequences chosen at the beginning, or the full set of AMPHORA Reference Sequences?
- Find the reads that match the HMM profile for this gene.
- Use a version of MarkerScanner.pl from AMPHORA.
- What profile HMM do we use? Do we use the profile built from all the AMPHORA Reference Sequences, or do we build a new profile HMM from only the sequences in the database?
- Do we use the regular or conservative mask from AMPHORA to remove problematic columns from the sequences before the HMM is built? If we use the mask at all, are we required to build the profile from the full set of AMPHORA Reference sequences for it to make sense? (Martin, this last question is mainly for you.)
- Some reads may get dropped in this step, as they may not match the profile well enough. This may produce a variable number of reads over the different simulations -- is this okay?
- Do we filter at this stage (possibly instead of after producing the DNA reads) in order to try to get at most one read per sequence (or some other limit)?
- Align the reads and the database sequences to the HMM profile and output the alignment.
- Use a version of MarkerAlignTrim.pl from AMPHORA.
Sam says: Thanks to Steve and Jonathan for the feedback. After discussing this with Katie, we have narrowed down the possible alternatives. However, we are still trying to make a few additional decisions. See the current proposition below, with the undecided matters in bold. If you have strong feelings about any of these issues, please let me know by putting a comment here or e-mailing me asap.
- For one run of a bunch of simulations, pick a fixed database as follows: use the MaxPD criterion to select 75 sequences from AMPHORA. Select 25 additional sequences uniformly at random from the rest of AMPHORA. This allows a combination of coverage with hopefully a bit of bushiness.
- For each simulation, choose sequences to be fragmented as follows:
- Choose some number N1 (approx 15-20) sequences uniformly at random from all of AMPHORA. (Some of these will appear in the database; some won't.) I will put a uniform distribution on these for fragmentation into 100 reads by MetaSim. Either use only these reads (for a uniform community distribution) or go on to the next step (for a non-uniform community distribution).
- Choose some other number N2 (approx 2-5), where N2 is probably much smaller than N1, sequences uniformly at random from all of AMPHORA. These will get a uniform distribution for fragmentation by MetaSim into maybe 25 reads. These reads will get added to the set above. This second step may make the community look more realistic. However, we can delay implementing this step to later simulations.
- Use a version of MarkerScanner.pl script from AMPHORA (without the E. Coli screen) to see which reads match the Marker profile hmms. Only the reads picked by AMPHORA will remain.
- Prune down the size of the set of remaining reads to some manageable number like 20. Do this by choosing a subset uniformly at random from the set of reads or by filtering reads so that there is at least one read per original sequence whenever possible. The first way may end up with many fewer than N1(+N2) of the original sequences represented, but it's perhaps more realistic. If we do this, we may want to choose more than 20 sequences in the final set of reads (or we may want to increase N1). The second way may be simpler to analyze, for now, at least.
Comments, suggestions, and feedback of any kind
- [Add your notes here...]
Steve says off the top of his head
- I think the number of reference/fragmented sequences should be chosen for convenience (i.e. ability to analyze quickly) at this point unless someone feels otherwise, we can always increase sample size to more 'realistic' levels later. So the suggestion of 100 reference sequences plus 20 random reads sounds good.
- If we want the reference sequences to be representative of the phylogenetic diversity in the AMPHORA data set we should choose reference sequences using maxPD criterion. Also this way you only have to choose reference sequences and generate a reference sequence phylogeny once for a given number of reference sequences.
- Choosing sequences to be fragmented as uniform random seems reasonable. I think it would be good to choose the sequences to be fragmented from the entire AMPHORA data set since otherwise we will have every fragmented sequence present in the reference data set which seems like it might be 'cheating' a bit. Across multiple simulations we'll end up with some proportion of the fragmented sequences being present in the reference database and some not which seems more realistic to me than the case where all the fragmented sequences are present in the reference sequences.
- Suggest we use the whole AMPHORA database for the blast database and HMM search, to simulate the actual process of identifying reads that we are using currently.
- I think it's fine if we have multiple fragmented sequences from the same reference sequence and if the exact number of fragmented reads varies a bit from simulation to simulation - we can deal with this during the analysis.
- For "Blastx the reads to find the best frame(s) for translation of the reads, and translate them into peptide reads. " we should be able to use HMMER3 I think which allows translational searches. But blastx may be easier.
- Regarding "do we remove these sequences from the database " it seems like we will have to remove them somehow - but this could come at analysis not for the database --- it would be trivial if we showed we could assign a read to the correct group if we had a 100% match to itself ...
- Regarding the above, it might be easier to set up two databases - one from which we will pull sequences and one which we will use to compare to. We could set them up by taking a database of genomes and splitting in half.
E-mail conversation with Dongying
On Thu, Jul 16, 2009 at 4:02 PM, Dongying Wu <firstname.lastname@example.org> wrote:
> On Thu, Jul 16, 2009 at 3:51 PM, Samantha Riesenfeld<email@example.com> wrote:
> That's a good idea -- thank you. Does this list agree with the MaxPD list for the AMPHORA tree? Maybe for the time being, since I use AMPHORA Reference sequences so heavily, I will just use bacteria and leave out archaea for now?
That is OK.
> Also, why should I avoid Candidatus or Mycoplasma? Right now, I'm certainly using Mycoplasma sometimes, as it is in the set of Reference Sequences....
Mycoplasma are human parasites and pathogens and have severe genome reductions, Many Candidatus on top of my list are animal (insect) symbionts with genomes heavily reduced. They usually make top of MaxPD list but have no presence in nature environments.
> On Thu, Jul 16, 2009 at 3:36 PM, Dongying Wu <firstname.lastname@example.org> wrote:
>> AMPHORA doesn't have archaea markers at all. The archaea list was based on radA tree.
>> The approach I can think of is to find a real metadata, and use AMPHORA markers to blastp the dataset and get all the potential markers from the metadata. search the meta marker genes back against AMPHORA, and keep track of the organisms that
>> 1. appear in the top 50 blastp hits (or below a certain e value cutoff)
>> 2. appear in the top 100 (or some other number) organisms in my list.
>> You should avoid anything with Candidatus in the name or mycoplasma.
>> On Thu, Jul 16, 2009 at 3:23 PM, Samantha Riesenfeld<email@example.com> wrote:
>> > Thanks, Dongying! One question: do you have any idea what would be the most sensible way to choose 75 total taxa from both lists? I'd like to use some bacteria and some archaea, but I don't know how to combine the MaxPD criterion for the separate lists together. Is there a reason they are analyzed separately?
>> > Is there a good proportion that would make sense to use?
>> > -Sam
>> > On Thu, Jul 16, 2009 at 3:13 PM, Dongying Wu <firstname.lastname@example.org> wrote:
>> >> Here is the list: ...
>> >> On Thu, Jul 16, 2009 at 3:09 PM, Samantha Riesenfeld<email@example.com> wrote:
>> >> > Hi y'all, Thanks to Jonathan, Steve, and Katie for feedback. I've made a few decisions about how to do this run of the simulations. There are a small number of things that I'm still trying to resolve one way or the other. If you get a minute, please check out: http://iseem.private.openwetware.org/wiki/Simulation_parameters#Decisions in case you have ideas or strong feelings about any of the decided or undecided matters. I'll probably produce the data tomorrow, and we can revisit whatever you want changed in future runs.
>> >> > Also, Dongying, could you send me your ordered list of taxa from AMPHORA that are most phylogenetically diverse? It might also be helpful to have the script that generated the taxa. I thought you gave it to me once, but I can't find it now...
>> >> > Thanks! -Sam