20.109(S14):Phylogenetic and primer analyses (Day7): Difference between revisions

From OpenWetWare
Jump to navigationJump to search
 
(31 intermediate revisions by 2 users not shown)
Line 9: Line 9:
The goals of phylogenetics are to 1) reconstruct the correct genealogical relationship between organisms/genes/sequence data and 2) to estimate their divergence since sharing a common ancestor. The process of phylogenetic reconstruction relies heavily on correct comparison of the traits under question, whether it is morphological data (such as wing lengths) or sequence data. For sequence data, comparison is made by the alignment of a set of orthologous sequences, which we will do in lab from the 16s rRNA gene.  
The goals of phylogenetics are to 1) reconstruct the correct genealogical relationship between organisms/genes/sequence data and 2) to estimate their divergence since sharing a common ancestor. The process of phylogenetic reconstruction relies heavily on correct comparison of the traits under question, whether it is morphological data (such as wing lengths) or sequence data. For sequence data, comparison is made by the alignment of a set of orthologous sequences, which we will do in lab from the 16s rRNA gene.  


Today, we have a choice of algorithms (distance-based, neighbor-joining, parsimony, likelihood, and other) for reconstructing a phylogenetic tree that depicts the relationships among aligned sequences. A number of models for defining how the mutations between sequences (genetic substitution) are assessed are also available. Each of these methods and models has advantages and disadvantages, which are closely considered (ideally!) in any formal published phylogenetics study. In the world of microbial community analysis, a popular choice is the neighbor-joining method (Saitou and Nei, 1987), which is one of the methods that deals most accurately and consistently with large data sets. Regardless of the best method, however, the result -- a reconstructed phylogenetic tree -- has proven to be an extremely useful qualitative and often even quantitative tool for examining the relationships among organisms.
Today, we have a choice of algorithms (distance-based, neighbor-joining, parsimony, likelihood, and other) for reconstructing a phylogenetic tree that depicts the relationships among aligned sequences. A number of models for defining how the mutations between sequences (genetic substitution) are assessed are also available. Each of these methods and models has advantages and disadvantages, which are closely considered (ideally!) in any formal published phylogenetics study. In the world of microbial community analysis, a popular choice is the neighbor-joining method (Saitou and Nei, 1987), which is one of the methods that deals most accurately and consistently with large data sets. Regardless of the best method, however, the result – a reconstructed phylogenetic tree – has proven to be an extremely useful qualitative and often even quantitative tool for examining the relationships among organisms.
 
EXPAND
 
Somewheres here? Reconnect to goal. Or earlier?:
We might ask: If two microbiomes are phylogenetically different, but functionally equivalent, does that mean they will be susceptible or resistant to similar pathogens? What do differences in microbiome structure mean for a bird’s ability to carry influenza virus, microsporidia, giardia species, or other gull associated microbes?


==Protocols==
==Protocols==
Line 23: Line 18:
====Overview====
====Overview====


You will take several steps to analyze your bird stool sequencing data, first with your partner, and ultimately across the entire class:
You will take several steps to analyze your bird stool sequencing data, first alone, then with the other two people assigned to your same sample, and ultimately across the entire class:


* For each ### clone of yours (e.g., #716-1 through #716-8) and your partner's (e.g., #716-9 through #716-16), you will trim and combine the forward and reverse sequencing results to get one intact 16S rRNA gene sequence.
* For each ### clone of yours (e.g., #716-1 through #716-8) and your partner's (e.g., #716-9 through #716-16 or #455-1 through #455-8), you will trim and combine the forward and reverse sequencing results to get one intact 16S rRNA gene sequence.
* For each sequence, you will use BLAST to determine the closest known bacterial species to that sequence.  
* For each sequence, you will use BLAST to determine the closest known bacterial species to that sequence.  
* Along with your partner, you will post the sequences and a summary of the species that you found, according to a specific template.
* Along with your partner (if you were assigned the same bird sample) or alone (otherwise), you will post the sequences and a summary of the species that you found, according to a specific template.
* You should then align all your robust sequences, up to 16 of them, in a program called MEGA, and subsequently construct a phylogenetic tree.  
* You should then team up with the other two people assigned to your same sample. Together, you will align all of your robust sequences, up to 24 of them, in a program called MEGA, and subsequently construct a phylogenetic tree.  
**'''Each team in each section''' must post an interim or complete alignment file for one sample ###.
**'''Each group of three''' must post an interim or complete alignment file and tree for one sample ###.
**T/R section files will necessarily be incomplete.
**The T/R section files for #252 will necessarily be incomplete, and will be added to by the W/F section.
**W/F morning and afternoon sections will add then add their sequences to the alignment file begun for their particular ###.
* These trees will be used to make cross-class comparisons, along with composite trees for each bird population. (See assignment description for further guidance.)
* These trees will be posted so that cross-class comparisons can be made.
** The trees can largely be compared by inspection, but you may optionally run a UniFrac analysis.
**T/R will post provisional trees to show progress.
**W/F morning section will post interim or complete trees, depending on the sample.
**W/F afternoon section will post complete trees that everyone can share.
* Finally, you may compare the MA versus AK trees by inspection, as some of them will be pretty homogeneous, or you may optionally run a UniFrac analysis. (Some guidance about UniFrac will be posted later this week.) You might also want to compare composite trees for MA (up to 16x4 sequences) versus AK.


====Part A: Understand possible insert orientations within vector====
====Part A: Understand possible insert orientations within vector====
Line 52: Line 43:
#Choose the "Login" link and then use "astachow@mit.edu" and "be20109" to log in.  
#Choose the "Login" link and then use "astachow@mit.edu" and "be20109" to log in.  
#At the bottom right should be a section called ''Recent Result''s. Click on ''More'' to expand it, and then click the icon under the Results column for your particular plate.
#At the bottom right should be a section called ''Recent Result''s. Click on ''More'' to expand it, and then click the icon under the Results column for your particular plate.
#*T/R orders were placed on x/xx, and W/F orders were placed on y/yy.
#*T/R orders were placed on 02/26, and W/F orders were placed on 02/28.
#The quickest way to start working with a particular sequence is to follow the "View" link under the ''Seq File'' heading. For ambiguous data, you may want to look directly at the ''Trace File'' as well.
#The quickest way to start working with a particular sequence is to follow the "View" link under the ''Seq File'' heading. For ambiguous data, you may want to look directly at the ''Trace File'' as well.


====Part C: Prepare sequences for analysis====
====Part C: Prepare sequences for analysis====


#Begin by downloading [[Media:20109_pSC-B.gb | this file]], which contains the DNA sequence of the vector we are using in GenBank format. Open the file in ApE (''A plasmid Editor'', created by M. Wayne Davis at the University of Utah), which is found on your desktop. Three items of interested are highlighted: the forward priming site, the reverse priming site and the two basepairs between which your sequence should be inserted.
#Begin by downloading [[Media:PCRBlunt_20109.ape| '''this file''']], which contains the DNA sequence of the vector we are using in GenBank format. Open the file in ApE (''A plasmid Editor'', created by M. Wayne Davis at the University of Utah), which is found on your desktop. Three items of interest are highlighted: the forward priming site, the reverse priming site and the two basepairs between which your sequence should be inserted.
#Follow the steps below for each clone that had successful forward and reverse sequencing reactions. In cases where only one reaction was successful, briefly check whether you can locate an insert. However, note that there is a known problem with this cloning procedure wherein sometimes an incomplete vector (with no insert and also missing a chunk of the vector) is returned. You should also scroll down to the bottom to check if any of your failed reactions were repeated; these are noted with an "R" and in some cases worked the second time around.   
#Follow the steps below for each clone that had successful forward and reverse sequencing reactions. In cases where only one reaction was successful, briefly check whether you can locate an insert. You should also scroll down to the bottom of the Genewiz table to check if any of your failed reactions were repeated; these are noted with an "R" and in some cases worked the second time around.   
#Paste the forward sequence of your first candidate into a new ApE file. Locate where the vector ends and the insert begins; trim away the vector.
#Paste the forward sequence of your first candidate into a new ApE file. Locate where the vector ends and the insert begins; trim away the vector.
#*While it is easiest to find the insert by doing ''Edit'' → ''Find'' (or Apple-F) using the base pairs right before the insert should begin, note that the string "CCC" may be mis-sequenced as "CC" or "CCCC" because long stretches of the same base (particularly Gs and Cs) are prone to error.
#*It may be easiest to find the insert by doing ''Edit'' → ''Find'' (or Apple-F) using the base pairs right before the insert should begin.
#Paste the reverse sequence of your first candidate into yet another ApE file. Immediately use ''Edit'' → ''Reverse Complement'' to adjust the sequence, and again trim away the vector.
#Paste the reverse sequence of your first candidate into yet another ApE file. Immediately use ''Edit'' → ''Reverse Complement'' to adjust the sequence, and again trim away the vector.
#*Why is it more convenient to work with the reverse complement when sequencing from the reverse direction?
#*''Why is it more convenient to work with the reverse complement when sequencing from the reverse direction?''
#In ApE, use ''Tools'' → ''Align Sequence'' to find where the forward and reverse sequences overlap. Combine them into one sequence with no repeated parts; where both forward and reverse sequence have coverage of the gene, choose whatever combination has the fewest Ns (ideally none!). Save this sequence as a new file called YourTeamDay-YourTeamColor_YourSampleID-"C"Candidate Number (e.g., WF-Purple_737-C1).
#In ApE, use ''Tools'' → ''Align Sequence'' to find where the forward and reverse sequences overlap. Combine them into one sequence with no repeated parts; where both forward and reverse sequence have coverage of the gene, choose whatever combination has the fewest unknown based, or Ns (ideally none!).  
#*You may find it easiest to print out the alignment in order to choose where to switch from using forward to using reverse sequence.
#*You may find it easiest to print out the alignment and mark up the hardcopy in order to choose where to switch from using forward to using reverse sequence. Let the base-pair numbers be your guides.
#*In pilot testing, we have run into one case in which the forward and reverse sequences have almost no overlap. It's not clear what caused this error. Before assuming that this error has struck your data, too, be sure that you reverse-complemented your reverse sequence!
#*Be aware that that long stretches of the same base (particularly Gs and Cs) are prone to error; for example, the string "CCC" may be mis-sequenced as "CC" or "CCCC."
#Finally, depending on the orientation of your insert, you may want to reverse complement the entire sequence. Use the original sequences of the forward and reverse 16S primers to guide your decision.
#Save this sequence as a new file called YourTeamDayYourTeamColorYourSampleIDC"Candidate Number (e.g., WFPurple737C1).
#You must now save each sequence in .txt format. If anyone can figure out how to do this task directly in ApE, let us know! Otherwise, you can copy-paste the sequence into a program such as TextEdit, choose ''File'' → Save, and in the pulldown menu select ''Plain Text''.
#Finally, depending on the orientation of your insert, you may want to reverse complement the entire sequence. Use the original sequences of the forward and reverse 16S primers to guide your decision.  
#*<font color=FF33FF>'''It is important for subsequent alignment that all sequences are 5' to 3' (begin with AGA).'''</font color>
#You must now save each sequence in .txt format. If anyone can figure out how to do this task directly in ApE, let us know! Otherwise, you can copy-paste the sequence into a program such as TextEdit, choose ''File'' &rarr; ''Save'', and in the pulldown menu select ''Plain Text''.


====Part D: Identify species from sequences====
====Part D: Identify species from sequences====
Line 75: Line 68:
# Under ''Choose Search Set'', select "16S ribosomal RNA sequences (Bacteria and Archaea)" from the ''Database'' pulldown menu.
# Under ''Choose Search Set'', select "16S ribosomal RNA sequences (Bacteria and Archaea)" from the ''Database'' pulldown menu.
# Click on the BLAST button. Matches will be shown by vertical lines between the aligned sequences, while mismatches and gaps will be shown with a dash.  
# Click on the BLAST button. Matches will be shown by vertical lines between the aligned sequences, while mismatches and gaps will be shown with a dash.  
# Because this gene is highly conserved, a number of species should come up as highly matched. However, one should (usually) be a best choice. Using the [[Media: S13-M1D7-template.xlsx | linked template]], write down this strain and its accession number, its associated max score, query coverage, max identity, gaps, mismatches, and full taxonomy; write down these parameters for the second most closely matched species as well. The taxonomy information can be found by clicking on the accession number and looking under the "organism" heading.  
#Because this gene is highly conserved, a number of species should come up as highly matched. However, one should (usually) be a best choice. Think carefully here rather than blindly accepting the top species listed.
#*For example, if a partial sequence for species A comes up as the top choice, a full sequence for species B comes up as the second choice, and a full sequence for species A is the third most closely matched choice, is species A or B truly closer to your original sequence?
#When you have decided which is best, use the [[Media: S14-M1D7-template.xlsx | '''linked template''']] to document this strain and its accession number, its associated max score, query coverage, max identity, gaps, mismatches, and full taxonomy; write down these parameters for the second most closely matched species as well. The taxonomy information can be found by clicking on the accession number and looking under the "organism" heading.  
#*Taxonomy order is kingdom, phylum, class, order, family, genus, and species.   
#*Taxonomy order is kingdom, phylum, class, order, family, genus, and species.   
# When a particular clone is very closely matched to two different species, you might choose to define it at a higher order, such as genus or family. When a particular clone is not well-matched to any known species (perhaps representing an unidentified or undocumented species), you might also choose to define it at a higher order when submitting this information in the phylogenetics program.
# When a particular clone is very closely matched to two different species, you might choose to define it at a higher order, such as genus or family. When a particular clone is not well-matched to any known species (perhaps representing an unidentified or undocumented species), you might also choose to define it at a higher order when submitting this information in the phylogenetics program.
# Be sure to rename the candidates according to your section day, team color, and clone number.  
# Be sure to rename the Excel file according to your your section day, team color, and sample ID number.  
# Please post all of your .txt files (up to 8) and also your Excel file to the table on today's Talk page when you have finished.
# Please post all of your .txt files (up to 8 per person) and also your Excel file to the table on today's Talk page when you have finished.


====Part E: Align sequences and construct tree====
====Part E: Align sequences and construct tree====
Line 85: Line 80:
For this next part you will use freely available software called Molecular Evolutionary Genetics analysis, or MEGA. Feel free to read additional information about this software at the [http://www.megasoftware.net MEGA website]. What you need should already be downloaded on your laboratory computers, or you can download onto your personal computers if you wish.
For this next part you will use freely available software called Molecular Evolutionary Genetics analysis, or MEGA. Feel free to read additional information about this software at the [http://www.megasoftware.net MEGA website]. What you need should already be downloaded on your laboratory computers, or you can download onto your personal computers if you wish.


You are welcome to get together with your clone ### partner for this next part, or if you really want to you can work alone from each other's text files and Excel write-ups -- but that's twice the work!
<font color=FF33FF>You may find it easiest to work immediately in your teams of three (with same bird sample ID) at this stage. Or, if you are at very different stages of the work, you may sequentially prepare and combine alignment files.</font color>
 
<font color=red>
 
Updates for W/F section:
 
* When adding your sequences to an existing T/R alignment file, you may need to first ''Insert Blank Sequence'', and then copy-paste into that slot.
* Some T/R ### pairs finished entirely, and in other cases only one person out of the pair was able to finish aligning their sequences. You can add to either a full or a half T/R alignment file, but just keep track of what data you are using. Be sure also to post your own .mas file, which the other T/R person can then add to and re-post.


</font color>
'''Important note for W/F Team Silver or any teams working in a staggered fashion:''' When adding your sequences to an existing alignment file, you may need to first ''Insert Blank Sequence'', and then copy-paste into that slot.


#Open MEGA. In the upper left corner, click on the icon labeled ''Align'', and choose ''Edit/Build Alignment'' from the pulldown menu. This selection should open the Alignment Explorer. When you are prompted, choose "DNA" alignment of course.
#Open MEGA. In the upper left corner, click on the icon labeled ''Align'', and choose ''Edit/Build Alignment'' from the pulldown menu. This selection should open the Alignment Explorer. When you are prompted, choose "DNA" alignment of course.
#Under ''Edit'', choose ''Insert Sequence from File'' and select your first .txt file. It should appear in the explorer.  
#Under ''Edit'', choose ''Insert Sequence from File'' and select your first .txt file. It should appear in the explorer.  
#Double-click to rename according to the species. Note that each sequence must have a unique name. Thus, it is best that you name according to both species and clone: for example, "Klebsiella oxytoca (TR-Blu-4B). This approach will also allow us to track which sequences came from which individual preps, which might be useful information.
#Double-click to rename according to the species. Note that each sequence must have a unique name. Thus, it is best that you name according to both species and clone: for example, "Klebsiella.oxytoca.TRBlu.4. This approach will also allow us to track which sequences came from which individual preps, which might be useful information.
#*Please use the following 3-letter abbreviations for your colors: Red, Org, Ylw, Grn, Blu, Pnk, Prp, Plt. "B" indicates the right-hand partner in the sequencing plate.
#*Please use the following 3-letter abbreviations for your colors: Red, Org, Ylw, Grn, Blu, Pnk, Prp, Sil, Wht.  
#When you have input all sequences from your section (up to 16), choose ''Edit'' &rarr; ''Select All'', followed by ''Alignment'' &rarr; ''Align by Clustal-W''.
#*<font color=FF33FF>Use periods, NOT spaces, in the species name, in order to facilitate potential UniFrac analysis later on.</font color>
#Now choose ''Data'' &rarr; ''Save Session'' and name the alignment according to section and clone (such as "TR-716-alignment"). Post this file on today's Talk page. That way W/F data can readily be combined with T/R data into one-tree, per gull sample, by using Open Saved Alignment Session followed by copy and paste.
#When you have input all sequences from your three-person team (up to 24), choose ''Edit'' &rarr; ''Select All'', followed by ''Alignment'' &rarr; ''Align by Clustal-W''.
#*To be clear, T/R will post 16-sequence alignment files, and W/F will post 32-sequence alignment files. Ditto for trees.
#Now choose ''Data'' &rarr; ''Save Session'' and name the alignment according to section and clone (such as "TR-716-alignment"). Post this file on today's Talk page.
#Under ''Data'', choose ''Phylogenetic Analysis''. ''When prompted, should you answer that the DNA is protein-coding or not protein-coding?''
#*If you are posting a partial alignment file for any sample, add "-partial" at the end of the filename. That way additional data can readily be combined into one tree per gull sample, by using ''Open Saved Alignment Session'' followed by copy and paste.
#Under ''Data'', choose ''Phylogenetic Analysis''.  
#*''When prompted, should you answer that the DNA is protein-coding or not protein-coding?''
#Now leave Alignment Explorer and go back to the original MEGA window.
#Now leave Alignment Explorer and go back to the original MEGA window.
#From the ''Phylogeny'' icon pulldown menu, select ''Construct/Test Neighbor-Joining Tree''. To proceed, click on ''Compute''.
#From the ''Phylogeny'' icon pulldown menu, select ''Construct/Test Neighbor-Joining Tree''. To proceed, click on ''Compute''.
#Finally, choose ''Image'' &rarr; ''Save as PDF File'' to document your tree. Save according to section and sample number as before.
#Finally, choose ''Image'' &rarr; ''Save as PDF File'' to document your tree. Save according to section and sample number as before.
#Please post the trees on today's Talk page, so we can see that everyone had the chance to do tree analysis on their own.
#Please post the trees on today's Talk page for class-wide use.
#*T/R section will post trees representing T/R data only.
#*W/F section will post trees that include both T/R and W/F sequences.
#In larger groups, for example at Thursday office hours, we can construct trees wherein all data from a given region are combined.


====Part F: Compare sets of trees====
====Part F: Compare sets of trees====


Guidance to come in a few days...
Most of the guidance about your data analysis is found in the assignment description [[20.109%28S14%29:Microbiome_summary |'''linked here'''.]]
 
However, before you leave today it is worth doing one check: Are there any supposedly identical species that show up on different leaves? If so, are the different leaves correlated with different team names? If so, please ask those teams to double-check that all their sequence files are facing in the correct direction! There are other reasons that identical species can show up on different leaves, however. As you think about why, remember that MEGA is using your sequence information (and only your sequence information) directly in its algorithm.


===Part 2: Microsporidia primer analysis===
===Part 2: Microsporidia primer analysis===


The 60 PCR samples were labeled numerically, and the associated sample definitions are included in the [[Media: S13-M1D7-PCR-calcs.xlsx | attached file]]. Each group should have three consecutive reactions; note that #1-12 are reference samples (with V1-PMP2 primer set) that will be added to the gel by the teaching faculty.
The 60 PCR samples were labeled numerically, and the associated sample definitions are included in the [[Media: S14-M1D7-PCR-calcs.xlsx | '''attached file''']]. Each group should have three consecutive reactions; note that #1-12 are reference samples (with V1-PMP2 primer set) that will be added to the gel by the teaching faculty.
 
Sample preparation: mix by pipetting, take 20 &mu;L, add 4 &mu;L loading dye, then load 21 &mu;L onto gel '''with your P20'''.


Each gel will have the following general structure for sample loading:
Each gel will have the following general structure for sample loading:
Line 125: Line 116:
{| border="1"
{| border="1"
! Lane  
! Lane  
!Sample (20 &mu;L)
!Sample (21 &mu;L)
!Lane
!Lane
!Sample (20 &mu;L)
!Sample (21 &mu;L)
|-
|-
| 1  
| 1  
Line 170: Line 161:
| T/R 1  
| T/R 1  
| Specificity (VC, EH, mixture)
| Specificity (VC, EH, mixture)
| Red
| Orange
| Orange
| Yellow
|-
|-
| T/R 2
| T/R 2
| Specificity (VC, EH, mixture)
| Specificity (VC, EH, mixture)
| Blue
| Blue
| Purple
| W/F Green
|-
|-
| T/R 3
| T/R 3
| Sensitivity (VC: lo, mid, hi)
| Sensitivity (EH: lo, mid, hi)
| Yellow
| Red
| Green
| Green
|-
|-
| T/R 4
| T/R 4
| Sensitivity (VC: lo, mid, hi)
| Sensitivity (EH: lo, mid, hi)
| Pink
| Pink
| Plat runs W/F Red!
| Blue
|-
|-
| W/F 1  
| W/F 1  
| Specificity (VC, EH, mixture)
| Specificity (VC, EH, mixture)
| Orange
| Blue
| Green
| Purple
|-
|-
| W/F 2
| W/F 2
| Specificity (VC, EH, mixture)
| Specificity (VC, EH, mixture)
| Blue
| Silver
| Pink
| White
|-
|-
| W/F 3
| W/F 3
| Sensitivity (EH: lo, mid, hi)
| Sensitivity (EH: lo, mid, hi)
| Yellow
| Red
| Purple
| Orange
|-
|-
| W/F 4
| W/F 4
| Sensitivity (EH: lo, mid, hi)
| Sensitivity (EH: lo, mid, hi)
| Platinum
| Yellow
| Red runs T/R Platinum!
| Pink
|-
|-
|}
|}
</center>
</center>
<br style="clear:both;"/>


Sample preparation: mix by pipetting, take 20 &mu;L, add 2.5 &mu;L loading dye, then load 20 &mu;L onto gel.
<font color=FF33FF>'''Note: Due to a miscalculation of how much polymerase we had left, only the specificity gels will be run today. The teaching faculty will run and post the sensitivity gels by Friday mid-day.'''</font color>
 
<br style="clear:both;"/>


==For next time==
==For next time==


Some of you have journal clubs next time. No other required homework is due on Day 8.
Some of you have journal clubs next time. No other required homework is due on Day 8.
#The following bonus assignment may be submitted on Day 8: Prepare a figure and caption for your [[20.109%28S13%29:_Primer_design_summary | primer design summary]] that shows your raw PCR results -- the agarose gel. (Later you might decide to process this data in some way, but not necessarily.) Write an early draft of the accompanying main text paragraph.


==Reagent list==
==Reagent list==
Line 236: Line 224:
**1 mM EDTA, pH 8.3
**1 mM EDTA, pH 8.3
*100 bp DNA ladder from New England BioLabs
*100 bp DNA ladder from New England BioLabs
==Navigation Links==
Next Day: [[20.109(S14):Journal club II (Day8)| Journal club II]]
Previous Day: [[20.109(S14):Journal club I (Day6)| Journal club I]]

Latest revision as of 07:33, 4 March 2014


20.109(S14): Laboratory Fundamentals of Biological Engineering

Home        Schedule Spring 2014        Assignments       
Module 1        Module 2        Module 3              

Introduction

Molecular phylogeny, or phylogenetics, is used to study relationships among organisms. The most common approach these days involves examining nucleic acid sequences or protein data from specific genetic loci; frequently the goal is to define data down to the species level. All life forms on earth trace back to a few organisms that lived billions of years ago and all share a common descent. Groups of organisms that are closely related to each other diverged from more recent shared common ancestors. Phylogeny remains one of the only effective means of describing these relationships, which can be difficult to assess by other means.

The goals of phylogenetics are to 1) reconstruct the correct genealogical relationship between organisms/genes/sequence data and 2) to estimate their divergence since sharing a common ancestor. The process of phylogenetic reconstruction relies heavily on correct comparison of the traits under question, whether it is morphological data (such as wing lengths) or sequence data. For sequence data, comparison is made by the alignment of a set of orthologous sequences, which we will do in lab from the 16s rRNA gene.

Today, we have a choice of algorithms (distance-based, neighbor-joining, parsimony, likelihood, and other) for reconstructing a phylogenetic tree that depicts the relationships among aligned sequences. A number of models for defining how the mutations between sequences (genetic substitution) are assessed are also available. Each of these methods and models has advantages and disadvantages, which are closely considered (ideally!) in any formal published phylogenetics study. In the world of microbial community analysis, a popular choice is the neighbor-joining method (Saitou and Nei, 1987), which is one of the methods that deals most accurately and consistently with large data sets. Regardless of the best method, however, the result – a reconstructed phylogenetic tree – has proven to be an extremely useful qualitative and often even quantitative tool for examining the relationships among organisms.

Protocols

Part 1: Bird microbiome analysis

Overview

You will take several steps to analyze your bird stool sequencing data, first alone, then with the other two people assigned to your same sample, and ultimately across the entire class:

  • For each ### clone of yours (e.g., #716-1 through #716-8) and your partner's (e.g., #716-9 through #716-16 or #455-1 through #455-8), you will trim and combine the forward and reverse sequencing results to get one intact 16S rRNA gene sequence.
  • For each sequence, you will use BLAST to determine the closest known bacterial species to that sequence.
  • Along with your partner (if you were assigned the same bird sample) or alone (otherwise), you will post the sequences and a summary of the species that you found, according to a specific template.
  • You should then team up with the other two people assigned to your same sample. Together, you will align all of your robust sequences, up to 24 of them, in a program called MEGA, and subsequently construct a phylogenetic tree.
    • Each group of three must post an interim or complete alignment file and tree for one sample ###.
    • The T/R section files for #252 will necessarily be incomplete, and will be added to by the W/F section.
  • These trees will be used to make cross-class comparisons, along with composite trees for each bird population. (See assignment description for further guidance.)
    • The trees can largely be compared by inspection, but you may optionally run a UniFrac analysis.

Part A: Understand possible insert orientations within vector

  1. Recall from Day 1 the sequences of the forward and reverse primers used to broadly amplify bacterial 16S rRNA gene segments:
    • Forward: 5' AGAGTTTGATCCTGGCTCAG
    • Reverse: 5' ACGGGCGGTGTGTACA
  2. Based on these sequences, you might expect that your insert will always begin with "AGA" and always end with "CGT." (Draw a picture to make sure you understand why the last three bases are as they are written here.)
  3. However, in blunt-end cloning, the insert – here our PCR product – can face in either orientation. Take a moment to figure out what other basepairs you might expect to see at the beginning or end of your sequenced insert.
    • The kind of cloning we are doing is called non-directional cloning. Directional cloning is possible when, for example, two different restriction enzymes are used to create overhangs that are complementary to the vector but not to each other.

Part B: How to download a sequence

  1. The data from Genewiz is available at the company website, linked here.
  2. Choose the "Login" link and then use "astachow@mit.edu" and "be20109" to log in.
  3. At the bottom right should be a section called Recent Results. Click on More to expand it, and then click the icon under the Results column for your particular plate.
    • T/R orders were placed on 02/26, and W/F orders were placed on 02/28.
  4. The quickest way to start working with a particular sequence is to follow the "View" link under the Seq File heading. For ambiguous data, you may want to look directly at the Trace File as well.

Part C: Prepare sequences for analysis

  1. Begin by downloading this file, which contains the DNA sequence of the vector we are using in GenBank format. Open the file in ApE (A plasmid Editor, created by M. Wayne Davis at the University of Utah), which is found on your desktop. Three items of interest are highlighted: the forward priming site, the reverse priming site and the two basepairs between which your sequence should be inserted.
  2. Follow the steps below for each clone that had successful forward and reverse sequencing reactions. In cases where only one reaction was successful, briefly check whether you can locate an insert. You should also scroll down to the bottom of the Genewiz table to check if any of your failed reactions were repeated; these are noted with an "R" and in some cases worked the second time around.
  3. Paste the forward sequence of your first candidate into a new ApE file. Locate where the vector ends and the insert begins; trim away the vector.
    • It may be easiest to find the insert by doing EditFind (or Apple-F) using the base pairs right before the insert should begin.
  4. Paste the reverse sequence of your first candidate into yet another ApE file. Immediately use EditReverse Complement to adjust the sequence, and again trim away the vector.
    • Why is it more convenient to work with the reverse complement when sequencing from the reverse direction?
  5. In ApE, use ToolsAlign Sequence to find where the forward and reverse sequences overlap. Combine them into one sequence with no repeated parts; where both forward and reverse sequence have coverage of the gene, choose whatever combination has the fewest unknown based, or Ns (ideally none!).
    • You may find it easiest to print out the alignment and mark up the hardcopy in order to choose where to switch from using forward to using reverse sequence. Let the base-pair numbers be your guides.
    • Be aware that that long stretches of the same base (particularly Gs and Cs) are prone to error; for example, the string "CCC" may be mis-sequenced as "CC" or "CCCC."
  6. Save this sequence as a new file called YourTeamDayYourTeamColorYourSampleIDC"Candidate Number (e.g., WFPurple737C1).
  7. Finally, depending on the orientation of your insert, you may want to reverse complement the entire sequence. Use the original sequences of the forward and reverse 16S primers to guide your decision.
    • It is important for subsequent alignment that all sequences are 5' to 3' (begin with AGA).
  8. You must now save each sequence in .txt format. If anyone can figure out how to do this task directly in ApE, let us know! Otherwise, you can copy-paste the sequence into a program such as TextEdit, choose FileSave, and in the pulldown menu select Plain Text.

Part D: Identify species from sequences

  1. The "nucleotide BLAST" alignment program can be accessed through the NCBI BLAST page or directly from this link. Follow the steps below for each clone, one at a time.
  2. Paste the sequence text that you prepared above into the "Query" box. If there were ambiguous areas of your sequencing results, these will be listed as "N" rather than "A" "T" "G" or "C" and it's fine to include Ns in the query.
  3. Under Choose Search Set, select "16S ribosomal RNA sequences (Bacteria and Archaea)" from the Database pulldown menu.
  4. Click on the BLAST button. Matches will be shown by vertical lines between the aligned sequences, while mismatches and gaps will be shown with a dash.
  5. Because this gene is highly conserved, a number of species should come up as highly matched. However, one should (usually) be a best choice. Think carefully here rather than blindly accepting the top species listed.
    • For example, if a partial sequence for species A comes up as the top choice, a full sequence for species B comes up as the second choice, and a full sequence for species A is the third most closely matched choice, is species A or B truly closer to your original sequence?
  6. When you have decided which is best, use the linked template to document this strain and its accession number, its associated max score, query coverage, max identity, gaps, mismatches, and full taxonomy; write down these parameters for the second most closely matched species as well. The taxonomy information can be found by clicking on the accession number and looking under the "organism" heading.
    • Taxonomy order is kingdom, phylum, class, order, family, genus, and species.
  7. When a particular clone is very closely matched to two different species, you might choose to define it at a higher order, such as genus or family. When a particular clone is not well-matched to any known species (perhaps representing an unidentified or undocumented species), you might also choose to define it at a higher order when submitting this information in the phylogenetics program.
  8. Be sure to rename the Excel file according to your your section day, team color, and sample ID number.
  9. Please post all of your .txt files (up to 8 per person) and also your Excel file to the table on today's Talk page when you have finished.

Part E: Align sequences and construct tree

For this next part you will use freely available software called Molecular Evolutionary Genetics analysis, or MEGA. Feel free to read additional information about this software at the MEGA website. What you need should already be downloaded on your laboratory computers, or you can download onto your personal computers if you wish.

You may find it easiest to work immediately in your teams of three (with same bird sample ID) at this stage. Or, if you are at very different stages of the work, you may sequentially prepare and combine alignment files.

Important note for W/F Team Silver or any teams working in a staggered fashion: When adding your sequences to an existing alignment file, you may need to first Insert Blank Sequence, and then copy-paste into that slot.

  1. Open MEGA. In the upper left corner, click on the icon labeled Align, and choose Edit/Build Alignment from the pulldown menu. This selection should open the Alignment Explorer. When you are prompted, choose "DNA" alignment of course.
  2. Under Edit, choose Insert Sequence from File and select your first .txt file. It should appear in the explorer.
  3. Double-click to rename according to the species. Note that each sequence must have a unique name. Thus, it is best that you name according to both species and clone: for example, "Klebsiella.oxytoca.TRBlu.4. This approach will also allow us to track which sequences came from which individual preps, which might be useful information.
    • Please use the following 3-letter abbreviations for your colors: Red, Org, Ylw, Grn, Blu, Pnk, Prp, Sil, Wht.
    • Use periods, NOT spaces, in the species name, in order to facilitate potential UniFrac analysis later on.
  4. When you have input all sequences from your three-person team (up to 24), choose EditSelect All, followed by AlignmentAlign by Clustal-W.
  5. Now choose DataSave Session and name the alignment according to section and clone (such as "TR-716-alignment"). Post this file on today's Talk page.
    • If you are posting a partial alignment file for any sample, add "-partial" at the end of the filename. That way additional data can readily be combined into one tree per gull sample, by using Open Saved Alignment Session followed by copy and paste.
  6. Under Data, choose Phylogenetic Analysis.
    • When prompted, should you answer that the DNA is protein-coding or not protein-coding?
  7. Now leave Alignment Explorer and go back to the original MEGA window.
  8. From the Phylogeny icon pulldown menu, select Construct/Test Neighbor-Joining Tree. To proceed, click on Compute.
  9. Finally, choose ImageSave as PDF File to document your tree. Save according to section and sample number as before.
  10. Please post the trees on today's Talk page for class-wide use.

Part F: Compare sets of trees

Most of the guidance about your data analysis is found in the assignment description linked here.

However, before you leave today it is worth doing one check: Are there any supposedly identical species that show up on different leaves? If so, are the different leaves correlated with different team names? If so, please ask those teams to double-check that all their sequence files are facing in the correct direction! There are other reasons that identical species can show up on different leaves, however. As you think about why, remember that MEGA is using your sequence information (and only your sequence information) directly in its algorithm.

Part 2: Microsporidia primer analysis

The 60 PCR samples were labeled numerically, and the associated sample definitions are included in the attached file. Each group should have three consecutive reactions; note that #1-12 are reference samples (with V1-PMP2 primer set) that will be added to the gel by the teaching faculty.

Sample preparation: mix by pipetting, take 20 μL, add 4 μL loading dye, then load 21 μL onto gel with your P20.

Each gel will have the following general structure for sample loading:

Lane Sample (21 μL) Lane Sample (21 μL)
1 Group 1, sample 1 6 V1-PMP2, sample 2
2 Group 1, sample 2 7 V1-PMP2, sample 3
3 Group 1, sample 3 8 Group 2, sample 1
4 DNA ladder (load 10 μL) 9 Group 2, sample 2
5 V1-PMP2, sample 1 10 Group 2, sample 2


It's essential that the correct reference sample be run on each gel, and therefore that groups requiring the same reference sample pair up. Let the table below be your guide:

Gel number Reference samples Group 1 Group 2
T/R 1 Specificity (VC, EH, mixture) Orange Yellow
T/R 2 Specificity (VC, EH, mixture) Blue W/F Green
T/R 3 Sensitivity (EH: lo, mid, hi) Red Green
T/R 4 Sensitivity (EH: lo, mid, hi) Pink Blue
W/F 1 Specificity (VC, EH, mixture) Blue Purple
W/F 2 Specificity (VC, EH, mixture) Silver White
W/F 3 Sensitivity (EH: lo, mid, hi) Red Orange
W/F 4 Sensitivity (EH: lo, mid, hi) Yellow Pink


Note: Due to a miscalculation of how much polymerase we had left, only the specificity gels will be run today. The teaching faculty will run and post the sensitivity gels by Friday mid-day.

For next time

Some of you have journal clubs next time. No other required homework is due on Day 8.

Reagent list

  • Mostly your brains!
  • Agarose gels
    • 2:1 mixture of high-resolution:standard agarose
    • Prepared in TAE buffer
    • With SYBR Safe stain (Invitrogen)
      • used at manufacturer's recommended concentration, 10000-fold dilution
  • NEB loading dye (6X stock)
  • Gels made and run in 1X TAE buffer
    • 40 mM Tris
    • 20 mM Acetic Acid
    • 1 mM EDTA, pH 8.3
  • 100 bp DNA ladder from New England BioLabs

Navigation Links

Next Day: Journal club II Previous Day: Journal club I