Action items for the simulations and analysis

From OpenWetWare
Jump to navigationJump to search

(Page created: Nov. 2009)

Simulation code - Sam: I got these items

  • Done. Output alignment of source seqs and ref seqs and as well as an alignment of reads and ref seqs. (Source seqs are the sequences from which the reads are generated.)
  • Done. Parametrize the input tree used by maxPD for the choice of the reference seqs. Also get randomly sampled reference sequences only from the sequences on the tree.
  • Done. Parametrize how many ref seqs are chosen via MaxPD and how many ref seqs are chosen uniformly at random from the rest of the sequences. Check for identical sequences?
  • Done. Either change the labels of the reads so that the ID of the source sequence is included, or build a table to be able to convert between read IDs and source seq IDs.
  • Done. Add a filter that can select a sample uniformly at random from the reads matched by the AMPHORA scan.
  • Done. Parametrize how many of the source sequences can be reference seqs. (Right now the source seqs are just picked uniformly at random from all sequences, so we don't have a handle on the overlap between source seqs and ref seqs.)

Prep for Simulations

  • Done. Look at average branch length for the tree for each marker gene in AMPHORA, and check to see what the extremely values are. If rpoB and smpB are representative of the extreme values, we are all set. Otherwise, pick one or two more marker genes that are representative and get the peptide sequences for those Marker genes from Morgan's DB. (May also want to consider the number of columns in the alignments.)
    • Tom is taking care of the above item in addition to analyzing alignment lengths
  • Future: adapt pipeline so it can be run with non-AMPHORA database of peptide seqs.
  • Future: see if we can get recA seqs and HMMs and alignments, so we can feed recA to the simulation pipeline.
  • Future: see if we can easily adapt pipeline to run with nucleotide sequences instead of peptides.


  • Sam: For now, I will take these items, but I may fork some of them out once I've checked that everything is running smoothly, if others are interested.

Run simulations from rpoB and smpB, or whatever Marker genes we've decided on. Here are the range of values for each parameter (all combinations of parameters should occur, except for the settings that cause us to run out of "source sequences" because the total number of sequences isn't big enough -- we may fix this later by generating additional sequences from the HMM, but for now, we just leave out these combos of parameters):

Reference database parameter settings

  • Number of reference databases created for each combination of parameter settings: 1
    • Need more?
  • Number of reference sequences: 20, 60, 100
  • Fraction of the reference sequences chosen via maxPD: 0.25, 0.50, 0.75

Simulation parameter settings

  • Number of simulation data sets created for each combination of parameter settings: 3
    • Need more? 10?
  • Ratio of the desired number of reads to the number of reference sequences: 0.25, 2.5, 25 (if there are enough source sequences)
  • Read lengths: 100, 400 nucleotides
  • Ratio of the number of source sequences that are reference sequences to the total number of source seqs: 0, R, where R = (# ref seqs)/(# all seqs)
  • Ratio of the number of source sequences (which are used to generate reads) to the desired number of reads: 1
    • Note that, depending on how many reads are generated and how they are filtered, all source sequences are unlikely to be represented by the final sample of reads.
  • Binary parameter for the filtering of reads:
    • The reads that match the AMPHORA profile for the gene are then filtered in one of two ways: (1) So there is at most one read per source sequence, or (2) the desired number of reads are sampled uniformly at random from the reads matching the HMM.
  • Ratio of number of reads generated by MetaSim to the desired number of reads: 4
    • Note that the number of reads generated must be larger than the desired number of reads, since many reads do not get picked up by the AMPHORA scan. This value may need to be adjusted even higher depending on how many reads the AMPHORA scan is picking up.

Build trees

  • If you do any of the following items, record publicly (here on this wiki page is fine) the exact command (e.g., for RAxML, FastTree, iterative RAxML, etc.) you run and which version of the software you run, so that we can make sure we stay consistent across simulations and people. (Previously, Steve and Tom have run different types of RAxML reconstructions.)

For each simulation, there is a pair of alignments: (1) the ref seqs and the source seqs; (2) the ref seqs and the reads. For each alignment, run:

  • RAxML with rapid bootstrap, PROTGAMMAWAG model
    • Should we try with JTT + CAT model for consistency with FastTree?
    • Sam: I don't know much about WAG versus JTT. Does anyone else? In RAxML, CAT is a hacked, fast substitute for the Gamma model for rate variation. Is that what it means in FastTree, too?
    • Sam: If we do rapid ML search with RAxML, even specifying the PROTGAMMAWAG model, it uses the CAT model for part of the search and then the GTR Gamma model for final optimization. I think doing this or a regular ML search under PROTGAMMAXXX is probably the right thing.
  • FastTree
    • Steve: FastTree v.2.0. Command: FastTree -pseudo alignmentName
    • FastTree only has one model for peptides: JTT+CAT (BLOSUM distance).
  • RAxML with iterative placement of reads
    • Steve: RAxML 7.2.2. Command: TBA. Can use the PROTGAMMAWAG model. Could try JTT + CAT as well
  • pplacer with posterior probabilities computed
    • Neither iterative RAxML nor pplacer resolves polytomies of reads -- we will see if this is a problem. There may be some options in terms of how to combine iterative with regular RAxML to get a faster method that works well for us.

Tree Build Log

If you are checking out a dataset for phylogenetic processing, indicate who you are, the data you are working with, the tree building parameters you will process, and the start and stop datetimes.

NOTICE: To run RAxML, you must convert FASTA to Phylip format, preferably the relaxed Phylip version to preserve header information, but there are funny characters in the sequence headers which cause RAxML to break. For now, I'm using a BioPerl script (on edhar: /home/sharpton/scripts/ to generate a strict phylip formatted alignment from the FASTA files. I am appending .phy to the end of alingment filenames so *.aln becomes *.aln.phy. A second script will be used to post-correct header names on the phylogenetic tree. I am still using FASTA files for FastTree input.

  • 2009-12-11 - TJS and SJR
    • Tom ran FastTree on all alignment files in all subdirectories of all three main directories: sims-RefDBs-rpoB-n20-m5, sims-RefDBs-rpoB-n20-m10, sims-RefDBs-rpoB-n20-m15. FastTree crashed on several of the bigger files (but not all) in each directory. We'll try to figure out why this is happening, probably after the annual meeting.
    • We are also trying to run the bootstrapped RAxML (on just a few files, see below), but this is taking so long (even with rapid bootstrap) that we decided to try running non-bootstrapped RAxML on all alignment files in all subdirectories of the three main directories.
  • 2009-12-10 - TJS - Checking out the following directory
    • big directory: sims-RefDBs-rpoB-n20-m5/sims-rpoB-n500m18r2000q-1p400u20/
      • sim1 subdirectory: sim1-rpoB-n500m18r2000q-1p400u20/
        • alignment files in the sim1 subdirectory:
          • sim1-reads-rpoB-sr-RefDB.aln
          • sim1-src-final-RefDB.aln
        • tree build parameters
          • RAxML v7.2.5 alpha: raxmlHPC -f a -x 12345 -# 100 -s FILEXXX.aln.phy -n FILEXXX.raxml.tre -m PROTCATJTT -w OUTPUT_SUBDIR_PATH/
          • FastTree v.2.1.0: FastTree -pseudo FILEXXX.aln > FILEXXX.ft.pseudo.tree

Comparing trees

  • Again, if you do any of these items, record exactly which software you use to do it.

For each pair of trees, corresponding to the pair of alignments: (1) convert the labels of the reads in one tree to the labels of the corresponding source sequences; (2) compair the two trees, which now have the same labels, according to the following measures:

  • Robinson-Foulds
    • Steve: I can run this in R/phangorn.
  • Branch-length distance -- Take the square root of the sum of the square of the difference in the lengths of corresponding branches.
  • Compute the leaf-leaf distances for each tree and then compute RMS error of these for two trees
    • Steve: I can run this in R/phangorn
  • Compute the minimum number of taxa pruned to obtain a Maximum Agreement Subtree.

Other things to do/consider:

  • We may also want to think about comparing the trees of (source + ref) sequences built by different algorithms, in order to get a handle on how different these are.
  • How do we use the bootstrap values? We could think about doing a weighted Robinson-Foulds, where the weights on the edges correspond to their bootstrap values. This could also be combined with the branch-length distance.