Wilke:Molecular Evolution

From OpenWetWare
Jump to navigationJump to search

The Basics

Evolutionary rate (dN/dS) analyses involve a basic pipeline: sequence alignment, phylogenetic inference, and finally evolutionary rate inference. When dealing with protein coding sequences, always align using amino acid data in order to preserve codons. Then, back-translate into nucleotide data, as is required for the final step in the pipeline. Phylogenies may be made either with amino acid or nucleotide data, although an amino acid tree may be slightly more accurate. Here is a python script to back-translate your data, given an unaligned nucleotide file and an aligned amino acid file.

Additionally note that a minimum of 10 sequences are recommended to achieve well-supported results in an evolutionary rates analysis.

Sequence Alignment

Commonly aligners include mafft, muscle, and prank. While prank is probably the most accurate, it is also very time consuming. We recommend using mafft for sequence alignments.

To align sequences in mafft, we additionally recommend using the "--auto" option. This will allow mafft to select the optimal alignment algorithm to use on your data. Mafft accepts a variety of file formats, including fasta and phylip (sequential and/or interleaved).

The infile should contain unaligned amino acid sequences, and aligned sequences will be sent to the outfile name provided.

mafft --auto infile > outfile

Phylogenetic Inference

Several options exist for creating a phylogeny. One can either infer a Maximum Likelihood or Bayesian tree. Accuracy is comparable between these two methods, but maximum likelihood inference may be faster. We recommend using the software RAxML for maximum likelihood phylogenetic inference. In the event that you have a large number of sequences (>500), we recommend using FastTree, which employs maximum likelihood techniques but has significant speed improvements.

RAxML Usage

  • Requires input alignment in phylip format
  • A model must always be specified using the -m option. While many options are available, for nucleotide data we recommend "GTRGAMMA" and for amino acid data "PROTJTTGAMMA" or "PROTWAGGAMMA." The following examples will all use GTRGAMMA, but note that other models are possible.
  • Basic inference, yielding a single tree inference.
raxmlHPC -m GTRGAMMA -s <infile> -n <outfile>
  • "Quick and dirty" inference, using RAxML's rapid bootstrapping algorithm. -# signifies the number of inferences to be done. The best inference is selected, and the cursory bootstrap confidence values are applied to the final tree.
raxmlHPC -m GTRGAMMA -s <infile> -n <outfile> -#100
  • "Slow and thorough" inference. This analysis takes place in three stages; (a) make a certain number of inferences, (b) determine bootstrap values, and (c) apply the bootstrap values to the best tree inference made in step (a). -# signifies number of inferences or bootstraps. Be sure that the output extension -n is the same for all runs!

Conduct 100 inferences, providing a random number seed with the -b option (any number will do):

raxmlHPC -m GTRGAMMA -s <input_alignment.phy> -# 100 -b <random_num_seed> -n <output_extension>

Conduct 100 bootstrap replicates:

raxmlHPC -m GTRGAMMA -s <input_alignment.phy> -# 100 -n <output_extension>

Apply bootstrap values to the best tree inference:

raxmlHPC -f b -m GTRGAMMA -s <input_alignment.phy> -z <file_with_bootstraps> -t <file_with_bestTree> -n <final_output>

FastTree Usage FastTree accepts most alignment file formats. It automatically assumes that your data are protein sequences. If you are creating a phylogeny with nucleotide data, provide the option -gtr in your command. By default, support values are included on your tree. To remove support values, include the option -nosupport in your command.

  • Rapid basic inference
FastTree <inputfile> > <outputfile>
  • Thorough, a bit slower, but likely more accurate inference
FastTree -gamma -mlacc 2 -slownni -spr 4 <inputfile> > <outputfile>

## This also works!
FastTree -gamma -slow <inputfile> > <outputfile>