User:R. Eric Collins/GenomicsTutorial/Genomics/Selection

Overview
The purpose of this exercise is to become familiar with the following:

Concepts

 * Genetic similarity/Evolutionary relatedness
 * Homology = similar
 * Orthology = separated by speciation event
 * Parology = separated by gene duplication event


 * Selection (at gene level)
 * Neutral = no effect on fitness
 * Negative = lower fitness
 * Positive = higher fitness

Techniques

 * Multiple sequence alignment
 * Calculating dS/dN ratio (synonymous/nonsynonymous changes)

Software/Databases

 * KEGG (genomics database)
 * KEGG Pathways (metabolic pathways database)
 * IMG (genomics and metagenomics database)
 * Jalview (software for sequence alignment editing and tree viewing)
 * HyPhy/Datamonkey (software for detecting selection)

Choose a biomolecule and pathway

 * 1) Go to KEGG Pathways
 * 2) Enter your favorite biomolecule (e.g. sulfate)
 * 3) Find a pathway of interest involving your biomolecule (e.g. Sulfur Metabolism), click on map image
 * 4) This shows a map of metabolic reactions and the Enzyme Commission (EC) numbers of the enzymes that mediate them. Click on "Reference Pathway", change it to "Reference Pathway (KO)" and click "Go". This will show which parts of the pathway are represented by genes from existing complete genome sequences.
 * 5) Click on "Pathway Entry" at the top. The will go to a page with summary information about the pathway. From here you can obtain a list of all the organisms that have entries in the pathway ("All Organisms" button), and a list of which genes are in which organism ("Ortholog Table" button).

Get sequences of gene of interest

 * 1) Select a gene of interest (under "Orthology") by clicking the KEGG Orthology (KO) link (e.g. K00394) or go Back to the map to select an enzyme mediating a reaction of interest
 * 2) Note the KEGG Orthology number of your gene of interest
 * 3) go to IMG
 * 4) Click "Find Functions", enter your KO number and select "KEGG Orthology ID" from the dropdown list
 * 5) If you want to limit your search to certain taxonomic groups (e.g. Deltaproteobacteria), select them below (use Tree view to select large groups). Otherwise all Complete and Draft genomes from all 3 Domains will be searched
 * 6) Click the number under "Gene Count" if you get results
 * 7) The gene list can be further filtered, e.g. by searching for a specific genus (e.g. Desulfovibrio)
 * 8) Click the Select box next to the genes you want to keep and click "Add Selected Genes to Cart"
 * 9) From the Gene Cart a number of analyses can be performed, including genomic neighborhood alignments, export of protein or nucleic acid sequences, and sequence alignments.

Do multiple sequence alignment

 * 1) Select "DNA" under "Sequence Alignment" and click "Do Alignment"
 * 2) Click "Launch Jalview"
 * 3) Remove any misaligned sequences and correct any obvious mistakes in the alignment, e.g. gaps that are not replicates of 3 nucleotides (1 codon)
 * 4) Select "File --> Output to textbox --> FASTA". Copy and paste into a text editor (e.g. Notepad, TextWrangler) and save to disk under a sensible name

Conduct search for site-specific selection

 * 1) Open in the text editor. Remove the stop codons if they are present at the end of the sequence (TAG, TAA, or TGA)
 * 2) Rename sequences to something sensible, save, close file
 * 3) Go to http://www.datamonkey.org/dataupload.php
 * 4) Choose your file, select "Codon" data, "Universal" code and click "Upload"
 * 5) If the results look alright, click "Proceed to Analysis Menu"
 * Precomputed results for all analyses
 * 1) "Execute" an automatic model selection tool.
 * 2) Run SLAC, FEL, and REL methods using selected model
 * 3) Explore results using Integrative Selection Analysis
 * 4) If 3D structure is known, sites under selection may be visualized, e.g. Crystal structure of adenylylsulfate reductase from Desulfovibrio gigas from NCBI Structure or Protein Data Bank