# Desireegonzalez Week 4/5

(Redirected from Desireegonzalez Week 4)

## Purpose

• The purpose of this assignment is to learn how to run statistical analyses of yeast strain data from Dr. Dahlquist's research, using Microsoft Excel, similar to those in the literature we had previously read and watched a video on.
• The purpose of the second part of this assignment was to learn how to use data analysis websites including STEM, YEASTRACT, and GRNsight to properly format data and visualize its gene networks.

## Methods

### Before Starting the Statistical Analysis

• This assignment will be analyzing the yeast strains: wild-type, dGLN3, dHAP4, and dZAP1.
• My assignment specifically will focus on the dHAP4 yeast strain.
• Filename of the master Excel sheet used: BIOL338_S19_microarray-data_dHAP4
• Master is copied on the first spreadsheet page of the Excel spreadsheet attached in the results section.
• Below are the type-points used to collect data, along with their number of replicates:
• The time-points, signified by (t#) were 15 minutes, 30 minutes, 60 minutes, 90 minutes, and 120 minutes.
• Time-points t15, t30, and t60 had four replicates, while t90 and t120 had 3 replicates.

### Statistical Analysis Part 1: ANOVA

1. A new worksheet named "dHAP4_ANOVA" was created in Excel prior to copying the

first three columns containing the "MasterIndex", "ID", and "Standard Name" from the given "Master_Sheet" worksheet of microarray data.

1. At the top of the first column to the right of this copies data, five columns were created with the labels, dHAP4_AvgLogFC_(TIME), where (TIME) was replaced by the time points t15, t30, t60, t90, and t120, respectively.
2. In the cell below the dHAP4_AvgLogFC_t15 header, the formula `=AVERAGE()` was typed.
3. After clicking in between the parenthesis of the formula, all the data in row 2 associated with dHAP4 and the timepoint t15 were highlighted and the enter key was pressed. This cell now contains the average of the log fold change data from the first gene at t=15 minutes.
4. In order to copy this formula on the entire column of data, the black plus sign at the bottom of the cell was double clicked.
5. Steps 1-4 were then repeated for the other timepoint data; t30, t60, t90, and t120.
6. Next, in the first empty column to the right of the dHAP4_AvgLogFC_t120 calculation, a column labeled dHAP4_ss_HO was created.
7. In the first cell below this header, type the formula `=SUMSQ()` was typed.
8. After clicking in between the parenthesis of the formula,all the LogFC data in row 2 for dHAP4 (but not the AvgLogFC) were highlighted and the enter key was pressed.
9. In the next empty column to the right of dHAP4_ss_HO, columns labeled dHAP4_ss_(TIME) using all time-points as in step 3, were created.
• At this point, the number of data points for each time point of THE dHAP4 strain were recorded. The total number of data points in the dHAP4 strain were also recorded.
• Total Number of Data Points in dHAP4 t15: 4
• Total Number of Data Points in dHAP4 t30: 4
• Total Number of Data Points in dHAP4 t60: 4
• Total Number of Data Points in dHAP4 t90: 3
• Total Number of Data Points in dHAP4 t120: 3
• Total Number of Data Points in dHAP4 (n): 18
10. In the first cell below the header dHAP4_ss_t15, the formula `=SUMSQ(<range of cells for logFC_t15>)-COUNTA(<range of cells for logFC_t15>)*<AvgLogFC_t15>^2` was typed. Within this formula, the range of cells for logF_t15, was substituted by highlighing the second row of all dHAP4_LogFC_t15 columns (the total range of all t15 data). The AvgLog_15 was found by again highlighting all dHAP4_LogFC_t15 columns and raising it to the second power. This formula was then copied onto the whole column as described in step 4.
11. This computation was then repeated for the t30 through t120 data points, making sure to substitute the various ranges of each data point, and the correct number of data points at each time-period into the formula.
12. After this, in the first column to the right of dHAP4_ss_t120, the column dHAP4_SS_full was made.
13. In the first row below this header, the formula `=sum(<range of cells containing "ss" for each timepoint>)` was typed.
14. In the next two columns to the right,the headers dHAP4_Fstat and dHAP4_p-value were created.
15. In the first cell of the dHAP4_Fstat column, the formula `(n-5)/5)*(<(STRAIN)_ss_HO>-<(STRAIN)_SS_full>)/<(STRAIN)_SS_full>` was typed.
• The phrases <(STRAIN)_ss_HO> and <(STRAIN)_SS_full>, were replaced by the selection of the dHAP4_ss_HO and dHAP4_SS_full respectively. The value of n was also replaced by the number 18.
16. Once the formula was used to calculate the first value,it was copied down to the whole column.
17. Once done with that calculation, the first cell below the dHAP4_p-value header, was filled with the formula `=FDIST(<(STRAIN)_Fstat>,5,n-5)`
• Here as in the step above, the phrase <(STRAIN)_Fstat> was substituted with the first value in the d_Fstat column.
• Also similar to the previous step, the "n" was changed by the number 18, and after applying the formula it was copied down the whole column.
18. Before moving on to the next step, a quick sanity check was performed to check the correctness of the computations.
• This sanity check was done by clicking on cell A1 and then clicking on the Data tab. Here, the Filter icon (which looks like a funnel) was selected. This caused little drop-down arrows to appear at the top of each column.
• The drop-down arrow on the dHAP4_p-value column was clicked. Next,"Number Filters" was selected, and the criterion of a p-value being less than 0.05 was entered to filter the data.
• This step resulted in 2294 data points being filtered out of the 6189 total.

### Calculation of the Bonferroni and p-value Correction

1. After doing the statistical analysis using the ANOVA above, adjustments must be performed to correct the calculated p values.
2. The next two columns to the right of the dHAP4_p-value column were both labeled dHAP4_Bonferroni_p-value.
3. In the cell beneath the first dHAP4_Bonferroni_p-value column, the equation `=<(STRAIN)_p-value>*6189` was typed. Notice that the <(STRAIN)_p-value> was substituted with the value found in the first cell of the dHAP4_p-value column.
4. Upon completion of this single computation, the formula was copied throughout the column.
5. Next, any p value that was greater than 1 was corrected by replacing the value with the number 1.
• This was done by typing the following formula into the first cell below the second dHAP4_Bonferroni_p-value header: `=IF(STRAIN_Bonferroni_p-value>1,1,STRAIN_Bonferroni_p-value)`, where "STRAIN_Bonferroni_p-value" refers to the cell (dHAP4_Bonferroni_p-value) in which the first Bonferroni p value computation was made. After doing the first computation, this formula was copied to the rest of the column.

### Calculation of the Benjamini & Hochberg p-value Correction

1. A new worksheet named "dHAP4_ANOVA_B-H" was created prior to the copying and pasting of the "MasterIndex", "ID", and "Standard Name" columns from the previous worksheet.
2. Next, the unadjusted p values from the dHAP4_ANOVA worksheet were copied and pasted it into Column D.
3. Columns A, B, C, and D where then selected before Column D was sorted in ascending order using the sort button from A to Z on the toolbar.
4. The next open column (E1) was then labeled "Rank;" these numbers will help to identify the p-value rank.
5. A series of numbers (from 1 to 6189) was created in the rank column by typing "1" into cell E2 and "2" into cell E3.
6. Both cells E2 and E3 were then selected and the plus sign on the lower right-hand corner of the cell was double-clicked to allow the addition of the rest of the numbers down the column.
7. Cell F1 was then labeled "dHAP4_B-H_p-value," this cell was used to calculate the Benjamini and Hochberg p value correction.
8. To begin the correction the following formula was typed into F2: `=(D2*6189)/E2`, prior to pressing enter and then copying the equation down to the entire column.
9. Then cell G1, was labeled "dHAP4_B-H_p-value,"and the equation `=IF(F2>1,1,F2)` was typed into cell G2.
10. After pressing enter and copying the equation to the entire column, columns A through G were selected and were sorted in ascending order by the use of the MasterIndex in Column A.
11. Once sorted, Column G was copied and pasted into the next column on the right of the dHAP4_ANOVA sheet.

### Running a Sanity Check: Determining the Number of Genes Significantly Changed

1. Prior to doing further analysis of the data, a sanity check must be performed to make sure that the calculations were done correctly
2. The sanity check will also find out the number of genes that were significantly changed at various p value cut-offs.
• To begin, row 1 of the dHAP4_ANOVA worksheet was selected and then the Filter icon in the Data tab was clicked until little drop-down arrows appeared at the top of each column.
• This will enable us to filter the data according to criteria set.
• The drop-down arrow was clicked for the unadjusted p value and the criterion to determine the number of genes that had a p-value less than 0.05 was set.
• The step above was repeated to determine the genes with the criterion of p-values less than 0.01, 0.001, and 0.0001.
• 2479 genes (40%) were determined to have a p < 0.05
• 1586 genes (25%) were determined to have a p < 0.01
• 739 genes (11%) were determined to have a p < 0.001
• 280 (4.5%) genes were determined to have a p < 0.0001
3. To apply more stringent criterion to our p values, the Bonferroni and Benjamini and Hochberg corrections were performed on the unadjusted p values.
• 75 genes (1.2%) have a p < 0.05 for the Bonferroni-corrected p value
• 1735 genes (28%) have a p < 0.05 for the Benjamini and Hochberg-corrected p value
4. "NSR1"(ID: YGR159C) is known to be induced by cold shock and can be used in data comparison.
• Its unadjusted p value is 0.016364
• Its Bonferroni-corrected p value is 101.278
• Its B-H-corrected p value is 0.055525
• Its average Log fold change at each of the time-points in the experiment are as follows:
• t15= 2.69945
• t30= 3.2508
• t60= 3.519975
• t90= -1.100566667
• t120= -1.797666667

### Clustering and STEM Software

#### Setting Up the Data for STEM

• Prior to using the STEM software, a new worksheet named "dHAP4_stem" needed to be added to the Excel workbook.
• In this new worksheet, all the data from the "dHAP4_ANOVA" worksheet was copied and pasted (making sure to only copy in the values).
• Next, the "Master_Index" column needed to be renamed to "SPOT". The column named "ID", also needed to be renamed to "Gene Symbol".The "Standard_Name" column was deleted.
• After renaming and deleting the columns, the data on the B-H corrected p value was filtered to be greater than 0.05.
• Once the data was filtered, all the rows (except for the header row) were deleted before undoing the filter.
• Data in all columns except for the Average Log Fold change columns was then deleted.
• The Log Fold change columns were then renamed with just the time and units (15m, 30m, 60m, 90m, 120m).
• This spreadsheet was then saved as a Text (Tab-delimited) (*.txt) file

#### Running STEM

• Click on the download link and download the `stem.zip` file to your Desktop.
• After this the file was unzipped and Gene Ontology and yeast GO annotation files were downloaded and placed them in the open folder `stem`..
• The `stem.jar` button file was then double clicked to launch the STEM program.
2. Once opened, in section 1 (Expression Data Info) of the the main STEM interface window,the "Browse" button was clicked to upload the text file that had been properly formatted above.
• No normalization/add 0 was chosen on the on the radio button.
• The box next to Spot IDs included in the data file was also clicked.
3. In section 2 (Gene Info) of the main STEM interface window, all default selections were left.
4. The "Browse" button to the right of the "Gene Annotation File" item was clicked and the file "gene_association.sgd.gz" (from the stem folder) was uploaded.
5. In section 3 (Options) of the main STEM interface window, the Clustering Method "STEM Clustering Method" was chosen; all other defaults were left the same.
6. In section 4 (Execute), the yellow "Execute" button was clicked to run the STEM.

#### Viewing and Saving STEM Results

1. When the STEM finished running, a new window called "All STEM Profiles (1)" opened up.
• The colored profiles on this window have a statistically significant number of genes assigned.
2. The "Interface Options" button was clicked and the X-axis scale was changed to "Based on real time". Then the Interface Options window was closed.
• Screenshots of this "All STEM Profiles" window and each of the SIGNIFICANT profiles, were taken and pasted it into a PowerPoint presentation as a way to save all the figures.
• At the bottom of each profile window, the "Profile Gene Table" button was clicked to see the list of genes belonging to the profile. Each table was saved onto the desktop by pressing the "Save Table" button. The files were names "dHAP4_profile#_genelist.txt", where the # was replaced with the actual profile number.
• Similarly, for each of the significant profiles, the "Profile GO Table" button was clicked to see the list of Gene Ontology terms belonging to the profile. These tables were also saved onto the desktop by clicking the "Save Table" button. The files were named "dHAP4_profile#_GOlist.txt", where the # was replaced by the actual profile number.

#### Analyzing and Interpreting STEM Results

1. When all the data was obtained, I chose dHAP4_Profile9 for further data intepretation.
2. The GO list file for Profile 9 was then opened using Excel.
3. The third row of this spreadsheet was filtered to show only GO terms that had a p-value of less than 0.05.
4. The column called "Corrected p-value" was then also filtered to show only GO terms that have a corrected p value of less than 0.05.
5. Six Gene Ontology terms from the filtered list (using the non corrected p-value less than 0.05) were then chosen and defined using the geneontology website.
• The definitions for each of the terms were looked up using at http://geneontology.org.
• The GO ID (e.g. GO:0044848) was copied and pasted into the search field on the left of the page.
• On the results page, the button that says "Link to detailed information about <term>" was clicked; this resulted in another page opening and giving the definition of the given term.

### Using YEASTRACT To Infer Which Transcription Factors Regulate a Cluster of Genes

1. The gene list of the significant profile (dHAP4_Profile9) was opened in an Excel spreadsheet.
2. The list of gene IDs were copied onto the clipboard.
3. Next, the web browser was launched to go to the YEASTRACT homepage;YEASTRACT database.
4. On the left hand side of this website, the tab "Rank by TF was clicked.
5. Here, the list of genes from the Profile 9 cluster were pasted into the box labeled ORFs/Genes.
6. The box for Check for all TFs was then clicked and all other defaults were left as is.
7. Next the genes were ranked by TF using: The % of genes in the list and in YEASTRACT regulated by each TF. Then, the Search button was clicked.
8. From the list of "significant" transcription factors which resulted from the YEASTRACT website, the fifteen significant transcription factors and the addition of GLN3, HAP4, and ZAP1, were chosen to run the model.
• Below is the list of Transcription Factors that were chosen to be run in YEASTRACT:
• Sum1p
• Sfp1p
• Sut2p
• YGR067C
• Hsf1p
• Ert1p
• Aft2p
• Tod6p
• Fhl1p
• Mbp1p
• Gcn4p
• Pho2p
• Sut1p
• Gal3p
• Ifh1p
• GLN3p
• ZAP1p
• HAP4p
9. On the YEASTRACT website, a Regulation Matrix was generated by going to the following linkGenerate Regulation Matrix, and copying and pasting the list of the transcription factors (depicted above) into both the "Transcription factors" field and the "Target ORF/Genes" field.
10. Before clicking "Generate," the regulations were filtered by clicking Only DNA binding evidence"
11. In the results that appeared, as a "Regulation matrix (Semicolon Separated Values (CSV) file)" were saved onto the desktop.

### Using GRNsight to Visualize The Gene Regulatory Networks

#### Preparing the YEASTRACT File for GRNsight

1. To be able to visualize the gene networks, the output files from YEASTRACT needed to be properly formatted.
2. The CSV file was opened in Excel.
• To see the data properly I had to go to the "Data" tab and select "Text to columns".
• In the Wizard that appears, "Delimited" was selected before clicking "Next," then "Semicolon" was selected before clicking "Next"and finally clicking "Finish" on the lastpage.
3. This modified file was then saved in the Microsoft Excel workbook format (.xlsx).
4. In order to be able to use this adjacency matrix in GRNmap (the modeling software) and GRNsight (the visualization software), the matrix needed to be transposed.
• A new worksheet labeled "Original Network" was inserted in the Excel file. On this new worsksheet, the entire matrix from the previous sheet was selected, copied, and then pasted into the new sheet.
• When pasting, "Paste special" was used by clicking the box for "Transpose".
• The labels for the genes in the columns and rows were then adjusted to delete the p and make everything all upper case. Also, cell A1 was renamed to"row genes affected/cols genes controlling".
• Finally, the gene labels were alphabatized the across the top and the side.
• The entire data matrix was selected before pressing the custom sort button within the "Data" Tab.
• Column A was sorted alphabetically, excluding the header row.
• Next the row 1 was sorted from left to right, excluding column 1.
• The organized adjacency matrix was titled "Original Network" and was Saved.

#### Visualization Using GRNsight

2. Here, the Regulation Matrix Excel File that had the "Original Network" worksheet was selected.
• I copied the "Original Network" data and then deleted both the row and column with the floating gene's names.
3. The updated file was then re-uploaded to GRNsight to visualize it.
• The "Grid Layout" button was clicked to arrange the nodes in a grid.
• Then it was screenshot and pasted into a PowerPoint presentation.

## Results

1. In this assignment I was studying the dHAP4 yeast strain.
2. The time-points used in this experiment were 15 minutes, 30 minutes, 60 minutes, 90 minutes, and 120 minutes. The time-points t15, t30, and t60 had four replicates, while t90 and t120 had 3 replicates.
• Total Number of Data Points in dHAP4 t15: 4
• Total Number of Data Points in dHAP4 t30: 4
• Total Number of Data Points in dHAP4 t60: 4
• Total Number of Data Points in dHAP4 t90: 3
• Total Number of Data Points in dHAP4 t120: 3
• Total Number of Data Points in dHAP4 (n): 18
3. Number of Genes found using various p-values and methods:
• 2479 genes (40%) were determined to have a p < 0.05
• 1586 genes (25%) were determined to have a p < 0.01
• 739 genes (11%) were determined to have a p < 0.001
• 280 (4.5%) genes were determined to have a p < 0.0001
• Bonferroni-corrected p-value:
• 75 genes (1.2%) have a p < 0.05
• Benjamini and Hochberg-corrected p-value
• 1735 genes (28%) have a p < 0.05
4. NSR1 p-values:
• Bonferroni-corrected= 101.278
• B-H-corrected=0.055525
5. NSR1 Average Log fold change for each of the experimental timepoints:
• t15= 2.69945
• t30= 3.2508
• t60= 3.519975
• t90= -1.100566667
• t120= -1.797666667

### Table of Results

Table 1: The number and percentage of dHAP4 genes found using unadjusted p-values and corrected p-values using the methods of Benjamini & Hochberg p-value Correction and Bonferroni and p-value Correction.

### Selection of dHAP4_Profile9

1. Why did you select this profile? In other words, why was it interesting to you?
• This profile was interesting to me because it had a very clear and large downward expression slope before slowly rising back upward. I also chose Profile9 since it had the most lines beneath the 0 X-axis from 15m to 90m in comparison to the other significant profiles.
2. How many genes belong to this profile?
• 289.0
3. How many genes were expected to belong to this profile?
• 56.1
4. What is the p value for the enrichment of genes in this profile?
• 1.3x10^-114
5. How many GO terms are associated with this profile at p < 0.05?
• 25/88 were found to have a p < 0.05.
6. How many GO terms are associated with this profile with a corrected p value < 0.05?
• None (0/88) were found to have a corrected p value < 0.05.

### Gene Ontology Definitions from STEM Results

1. Endoplasmic reticulum membrane: The lipid bilayer surrounding the endoplasmic reticulum (Gene Ontology Resource 2019).
2. Apoptotic process: A programmed cell death process which begins when a cell receives an internal (e.g. DNA damage) or external signal (e.g. an extracellular death ligand), and proceeds through a series of biochemical events (signaling pathway phase) which trigger an execution phase. The execution phase is the last step of an apoptotic process, and is typically characterized by rounding-up of the cell, retraction of pseudopodes, reduction of cellular volume (pyknosis), chromatin condensation, nuclear fragmentation (karyorrhexis), plasma membrane blebbing and fragmentation of the cell into apoptotic bodies. When the execution phase is completed, the cell has died (Gene Ontology Resource 2019).
3. Protein folding: The process of assisting in the covalent and noncovalent assembly of single chain polypeptides or multisubunit complexes into the correct tertiary structure (Gene Ontology Resource 2019).
4. Cytoplasmic translation: The chemical reactions and pathways resulting in the formation of a protein in the cytoplasm. This is a ribosome-mediated process in which the information in messenger RNA (mRNA) is used to specify the sequence of amino acids in the protein (Gene Ontology Resource 2019).
5. Endoplasmic reticulum to Golgi vesicle-mediated transport: The directed movement of substances from the endoplasmic reticulum (ER) to the Golgi, mediated by COP II vesicles. Small COP II coated vesicles form from the ER and then fuse directly with the cis-Golgi (Gene Ontology Resource 2019). Larger structures are transported along microtubules to the cis-Golgi.
6. Fungal-type vacuole membrane: The lipid bilayer surrounding a vacuole, the shape of which correlates with cell cycle phase. The membrane separates its contents from the cytoplasm of the cell. An example of this structure is found in Saccharomyces cerevisiae (Gene Ontology Resource 2019).

### YEASTRACT Tables and Questions

1. How many transcription factors are green or "significant"?
• 15
2. Are GLN3, HAP4, and/or ZAP1 on the list? If so, what is their "% in user set", "% in YEASTRACT", and "p value".
• Yes, GLN3, HAP4, and ZAP1 are on the list.
• GLN3:% in user set was 20.42%; % in YEASTRACT was 4.97%; p value was 0.036104013.
• HAP4:% in user set was 19.72%; % in YEASTRACT was 5.71%; p value was 0.002321739657752.
• ZAP1:% in user set was 24.91%; % in YEASTRACT was 4.68%; p value was 0.071119199964520.

## Scientific Conclusion

• After working on this assignment, I do believe that the purpose of understanding how to run statistical data analysis on yeast strain data was fulfilled. I have gained a greater understanding of how to determine the p-values of a set of data using an ANOVA, Bonferroni p-value Correction, and Benjamini & Hochberg p-value Correction on Microsoft Excel. The major finding that resulted from this lab, was the fact that a larger number of genes are identified with p-values less than 0.05 when using unadjusted data. The smaller values collected with the Bonferroni p-value Correctio and Benjamini & Hochberg p-value Correction, assisted in expressing the ways that scientist can modify data to make it more stringent.
• After working on the second half of this assignment, I do believe that I have gained an understanding of how to use the data analysis websites of STEM, YEASTRACT, and GRNsight to properly format data in order to make models (like gene networks) that make it easier to visualize the relationships between various data points (specifically dHAP4 genes). The STEM dHAP4_Profile9 data and its further analysis with the help of YEASTRACT AND GRNsight revealed that there was a significance in some of the genes, and that a large majority of the genes were interconnected in their networking. In reference to the use of DNA microarrays initially to collect the data, this interconnection of networks could have resulted as a larger number of yellow spots due to the similarities in gene pathways.

## Acknowlegements

• I communicated with my homework partners, Ava and Brianna through text message and face-to-face in the computer lab to compare numbers obtained from our statistical analysis. I also emailed Dr. Dahlquist and Dr. Fitzpatrick for assistance when I was experiencing issues using GRNsight.
• In addition to communicating with the professors, I copied and edited some of the wikitext for the methods sections of this assignment from the BIOL388/S19: Week 4 page written by Dr. Dahlquist and Dr. Fitzpatrick.

Except for what is noted above, this individual journal entry was completed by me and not copied from another source. Desireegonzalez (talk) 22:30, 20 February 2019 (PST)

## References

Dahlquist, K. & Fitzpatrick, B.G. (2019)."BIOL388/S19: Week 4." Retrived from https://openwetware.org/wiki/BIOL388/S19:Week_4 on 13 February 2019.

Gene Ontology Resource (2019, February 2). Apoptotic Process. Retrieved from http://amigo.geneontology.org/amigo/term/GO:0006915 on 19 February 2019.

Gene Ontology Resource (2019, February 2). Cytoplasmic Translation. Retrieved from http://amigo.geneontology.org/amigo/term/GO:0002181 on 19 February 2019.

Gene Ontology Resource (2019, February 2). Endoplasmic Reticulum Membrane. Retrieved from http://amigo.geneontology.org/amigo/term/GO:0005789 on 19 February 2019.

Gene Ontology Resource (2019, February 2). Endoplasmic Reticulum To Golgi Vesicle-Mediated Transport. Retrieved from http://amigo.geneontology.org/amigo/term/GO:0006888 on 19 February 2019.

Gene Ontology Resource (2019, February 2). Fungal-Type Vacuole Membrane. Retrieved from http://amigo.geneontology.org/amigo/term/GO:0000329 on 19 February 2019.

Gene Ontology Resource (2019, February 2). Protein Folding. Retrieved from http://amigo.geneontology.org/amigo/term/GO:0006457 on 19 February 2019.

Below are the links to all the Assignments and Journal Entries of the Spring 2019 Semester.

User Page: user:desireegonzalez

Template Page: template:desireegonzalez

Weekly Assignment Pages:

Individual Journal Entry Pages:

Shared Journal Pages: