BIOL478/S20:SARS-CoV-2 Phylogeny Exercise

From OpenWetWare
Jump to navigationJump to search

This exercise is due at the beginning of class (1:00 PM) on Monday, April 6. Please e-mail a Word document to Dr. Dahlquist.


This exercise is based on:

Data & Resources

  1. MN908947: Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1
  2. MN996532: Bat coronavirus RaTG13
  3. AY278741: SARS coronavirus Urbani
  4. KY417146: Bat SARS-like coronavirus isolate Rs4231
  5. MK211376: Coronavirus BtRs-BetaCoV/YN2018B
  • Only raw sequence reads are available for the pangolin coronavirus sequences (consensus generated from SRR10168377 and SRR10168378 NCBI BioProject PRJNA573298)

Part 1: GenBank

In this section you will take a closer look at a GenBank record and the type of data that is stored there. Once you reach the nucleic acid data associated with the Andersen et al. paper you will see that there are a variety of different ways to view the data.

Choose one of the GenBank records from the Data & Resources section above and view both the full record and the FASTA formatted sequence.

  • What was the accession number of the sequence you chose?
  • What information is provided in the GenBank record?
  • Download each of the sequences in the Data & Resources section in FASTA format to your local hard drive.
    • Click the "Send to" link in the upper right of the page. Select "Complete Record", "File" as the Destination, and "FASTA" as the format. Click the "Create File" button. Be careful to remember where you put the file and what you name it so that you can find it later.
  • Open the file that you saved with a word processor to confirm that you have the sequences and that they are in the FASTA format. In the FASTA format each sequence is preceded by a label which begins with the greater than sign (>). For example, the first 10 lines of the SARS-CoV-2 sequence is:
>MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome

Part 2: Creating a phylogenetic tree with

In order to analyze sequence data we will use the, a free, simple to use web service dedicated to reconstructing and analysing phylogenetic relationships between molecular sequences.

  1. While we could create a phlylogenetic tree with the entire genome sequence of the viruses, we are mainly interested in the spike protein. Links have been provided to the individual spike protein sequence records in the Data & Resources section. Download the five protein sequences in FASTA format, just like you did for the whole genome sequence.
  2. In your browser, go to the website Scroll down on the page to the section labeled ‘Phylogeny analysis’, and click on the text ‘One Click’.
  3. Click in the large text field labeled ‘Upload your set of sequences in FASTA, EMBL, or NEXUS format’. Use Ctrl-V to paste your sequences here, then click the “Submit” button.
  4. You will see a page named ‘Alignment results’. After your alignment is complete, you will see a new page named ‘Phylogeny results’. Finally, you will see a page named ‘Tree rendering results’. You will come back to these pages later. For now, find the numbered tabs located just beneath the text ‘One Click Mode’, and click on the tab labeled ‘3. Alignment’.
    • Within the alignment, individual positions are color-coded to indicate their conservation, or how similar the sequences are to each other at that position. Blue highlighting indicates high conservation (i.e., the sequences are identical or at least very similar), while gray highlighting indicates lower conservation and white highlighting indicates little if any conservation.
  5. Near the bottom of the page, under ‘Outputs’, click on ‘Alignment in Clustal format’. This will display your alignment in a text-only format in which each position’s conservation is indicated by a symbol underneath the alignment block (“*” for invariant, “:” for highly conserved, “.” for weakly conserved, and a space for not conserved). Copy and paste this entire alignment into your Word document.
    • Hint: After you paste it in, change the font to "Courier" or "Courier New" This is a fixed width font and will allow all the letters to line up properly.
  6. Now go back and click on the tab ‘6. Tree Rendering’, and you will see a phylogenetic tree of the five sequences.
    • On this tree, horizontal lines (branches) represent individual evolutionary lineages. By contrast, vertical lines (splits) represent mutation events, and the vertical length of each split is drawn purely for visual clarity with no biological meaning. The left-most split is called the root of the tree, and represents a hypothesis about the most recent common ancestor (MRCA) of the sequences within your tree.
    • The length of each branch represents the percentage change in amino acid sequence occurring along that branch, relative to the scale bar shown at the bottom of the tree. The scale bar will be a number between 0 and 1 and can be reinterpreted as a percent. For example, 0.05 would be 5%. The tree may also contain support values for each clade; shown in red on the branches, also expressed as a number between 0 and 1. 0.05 would be 5%. In general, a higher support value indicates a higher statistical confidence in a particular clade.
    • Copy the image and paste it into your Word document.
  7. Compare the tree to the multiple sequence alignment. See if you can relate the differences in the sequences to the topology of the tree diagram.
  8. Relate your alignment to Figure 1 of the Andersen et al. (2020) paper. Find the amino acid sequences that are highlighted in the figure and mentioned in the text in your alignment. Change something about the font to highlight them in your Word document to document that you found them. For our next assignment, we will look at the structure of the spike protein in more detail to understand how certain amino acids are used to attach the virus to the host cell.

Part 3: Prepare for looking at the structure next week



  • Make a list of at least 10 biological terms for which you did not know the definitions when you first read the article.
  • Define each of the terms.
    • You can use the glossary in any molecular biology, cell biology, or genetics text book as a source for definitions, or you can use one of many available online biological dictionaries (some of which are listed below).
  • For each definition, give a properly APA-formatted citation for the definition.


  • Write an outline of the article. The length should be 2-3 pages. Your outline can be in any form you choose. The text of the outline does not have to be complete sentences, but it should answer the questions listed below and have enough information so that others can follow it. However, your outline should be in YOUR OWN WORDS, not copied straight from the article. It is not acceptable to copy another student's outline either. Even if you work together to understand the article, your individual outlines need to be in your own words.
    1. What is the main result presented in this paper? (Hint: rephrase the title into plain English and expand a little)
    2. What is the importance or significance of this work?
    3. What is known about the problem?
    4. What is the goal of their study?
    5. For each of the figures, explain the main result that they are showing.
      • Briefly describe how the data were obtained for each of the figures.
    6. What next steps for research do the authors propose based on this work?

Online Biological Dictionaries