Discussion on simulating metagenomic data

From OpenWetWare
Jump to navigationJump to search

Major overhaul (15 April 2009): I reorganized the wiki pages on simulations; the main page is Simulations of Metagenomic Data, and this page is reserved for discussion.
-- Sam Riesenfeld

Which parameters?

We are trying to decide on the short list of parameters which we would like to use in our simulations. Parameters that have been suggested include:

  • Abundance distribution of species
    • What types of distributions should we try initially?
    • Anything but the simplest distribution probably requires a little scripting to work with MetaSim.
  • Which protein families
    • Steve has suggested even just starting with one protein family to see what we can get the phylogenetic methods to do.
  • Read lengths
  • Which complete genomes to sample from and which to have as "complete"
    • Which genomes should we start with for our initial simulations?
  • Number of complete sequences and number of fragments
  • ...[Feel free to add to this list.]

It would be helpful to know if people (especially Dongying and Martin) have any thoughts on this list, including opinions about which parameters we should begin with or how certain parameters should vary in our initial simulations. Sam and/or Steve are planning to get started on simulations the first week of Feb.

Questions and unresolved issues

  • Do we want to just start testing some methods on Martin's simulated data set for AMPHORA described below?
    • If so, which methods? (The answer to this question should probably go on the Phylogenetic Methods page.)
  • Are there reasons for us to generate a different simulated data set or wait for Martin's additional simulated data?
    • It seems the general answer is "yes". We will want to include additional protein families, more species, different abundance distributions, etc.
  • Are we interested in using the MetaSim software package? A description of MetaSim is below.
    • One note: It appears to be a capable simulator, but MetaSim is closed source and so we run the risk of a situation much like what Srijak just encountered with wu-blast: the copyright holders decide to prohibit redistribution and remove the software from the web.
    • I assume Jonathan put in the above note about using closed source code. It is a very reasonable reservation. There are also advantages to using MetaSim (e.g., it may make getting things going soon a bit simpler). It would be worthwhile to make a decision about this sooner rather than later. At the moment, I think Steve is experimenting with MetaSim, and I probably will, too. I'm also happy to pursue other avenues, like using and altering Martin's scripts. One problem with that, as Martin pointed out, is that the sequencing errors are not simulated. I don't know how much of a concern this is for us. [Sam]

Morgan Price's simulated data set

Morgan Price tested FastTree to build trees from fragmented alignments. See his note on Jonathan's blog. Info on his simulated alignments follows.

E-mail from Morgan to Jonathan

Hi Jonathan,

I've put the alignments up for download at http://morgannprice.org/overgap50.tar.gz

The true trees were obtained from an actual COG alignment, computed with PSI-BLAST using (mostly full-length) sequences from complete genomes with PhyML. The alignments were simulated with ROSE.. The simulated alignments with truncated sequences are in *.p, and the inferred trees are in *.fastme.gapcmp, *.ftpsu.gaptree, etc. (ftpsu means FastTree with pseudocounts.)

Similar data for trees with realistic gaps, without the truncations, are in http://morgannprice.org/newsim50.tar.gz

For more explanation of how the sequences were truncated, see http://www.microbesonline.org/fasttree/fragments.html

If you have any questions please let me know.

-- Morgan Price
Arkin lab

AMPHORA data set

Martin created a simulated data set for AMPHORA. He described how he created his data sets in the e-mail below.

Martin's email to Sam, etc.

Hi Sam,
The simulated data sets for AMPHORA can be downloaded from http://bobcat.genomecenter.ucdavis.edu/AMPHORA/download.html. The data set contains randomly truncated protein sequences of 31 markers from 100 representative bacterial genomes. The protein sequence length ranges from 100 aa to 300 aa, which is equivalent to 300bp to 900bp DNA sequence fragments.

Briefly, this is how I generate the simulated data
1. Choose bacterial representatives using the genome tree
2. Remove fast evolving bacteria such as Mycoplasma
3. Retrieve protein sequences of the 31 markers from the complete genome database (ComboDB)
4. Truncate the protein sequences randomly to form 100-300 aa fragments

The command and script I used are as follows. You might need to tweak the codes a little bit for them to work since I run my scripts on my local machine previously. Now they reside in the genbeo.genomecenter.ucdavis.edu and genbeo has a slightly different environment setting. Jenna or Srijak should be able to help on this.

I'm also planing to generate additional simulation data set. I'll have a rotation student coming this spring and I hope he can help us on this.


# Choose representatives
/share/eisen/mwu/puma-scripts/maxPD.pl -t $combo/Marker/combo.phyml.wag.tre -o tmp -n 200
cat maxPD.out |gawk '{print $1}'> tmp
/share/eisen/mwu/puma-scripts/COMBODB/ListChromosomeByTaxonid.pl tmp > contigid

# Get marker Sequences from representatives
/share/eisen/mwu/puma-scripts/GenomeTree/GetMarkerBySpecies.pl contigid /home/mwu/Projects/ComboDB/Markers/020908/

# Truncate the sequences
ls *.pep |gawk '{print "/share/eisen/mwu/puma-scripts/subsequence.pl "$_ " 100 300"}'|bash

MetaSim software

If we want to simulate new data sets, we may be able to make use of the MetaSim software by Huson and Ott.

From the website, features include:

  • integrates a database for source genome sequences
    • Can this be our own db? Can we alter the number of species and the gene families used? [Katie]
      • Yes [Steve]
  • generates sets of synthetic reads or mate-pairs based on adaptable sequencing error models (e.g. for Sanger chemistry, Roche's 454 and Illumina (former Solexa)
    • presumably we can alter read length within these different technology models? [Katie]
      • Yes [Steve]
  • enables the user to configure abundance values for each organism to model specific taxon compositions
  • provides a population sampler to generate evolved sequences based on source genomes and a given evolutionary tree
  • can be controlled via graphical user interface or in command line mode
    • Command line mode in linux/unix? Is the GUI platform independent? [Katie]
      • Yes and yes [Steve]

Potential con is the software is closed source.

See the discussion above on the pros and cons of using MetaSim.

ReadSim software

Martin mentioned ReadSim software by Schmid and Huson as an alternative. However, the website says ReadSim is no longer maintained and has basically been incorporated into MetaSim. It is open source.

  • Is it really open source? Martin originally thought it might be, but then said it was not. I can't tell from the web page. [Sam]

Rose software plus fragmentation

ROSE (Stoye et al., 1998) has been used by Steven Brenner's lab to simulate (via evolutionary models) protein family alignments that can then be fragmented, truncated, and edited with "sequencing errors". ROSE was also used by Cheng et al and earlier by Hartmann and Vision. Morgan Price used ROSE too.

Datasets for testing

Here is a list of datasets we have accumulated for testing phylogenetic inference methods.

Scenarios to test

  • 1) One read per taxon
    • Metasim Simulation Parameters:
      • Preset Name: Exact
      • Number Of Reads / Mate Pairs=5000
      • Error Model=Exact
      • Exact Error Model DNA Clone Parameters=
      • Distribution: Normal
      • Mean: 290.0
      • 2nd parameter: 0.0
      • Combine All Files=true
      • Uniform Sequence Weights=true
      • Number Of Threads=1

  • 2) one read per taxon, some overlap in reads
  • 3) mulitple reads non-overlapping per taxon
  • 4) mix of rare (not overlapping) and abundant taxa (overlapping)

Methods for Comparing Trees

Here is a rough overview of tree comparing approaches. Feel free to edit.