Angela C Abarquez Week 4

Electronic Lab Notebook

Purpose

By performing data analysis of the wild type microarray data significant transcription factors and genes can be identified and made into visuals.

Methods

This data analysis was performed on the wild type microarray data (see "Data and Files" section).

At time point t15: 4 replicates

At time point t30: 5 replicates

At time point t60: 4 replicates

At time point t90: 5 replicates

At time point t120: 5 replicates

Statistical Analysis Part 1: ANOVA

A new worksheet was created and named "wt_ANOVA".
The first three columns titled "MasterIndex", "ID", and "Standard Name" were copied from the "Master_Sheet" worksheet for the wild type strain (obtained from LMU Box) and pasted into this new worksheet.
The columns containing the data for the wild type strain were also copied and pasted into the new worksheet.
At the top of the first column to the right of the data, five columns were created and named "wt_AvgLogFC_t15", "wt_AvgLogFC_t30", "wt_AvgLogFC_t60", "wt_AvgLogFC_t90", "wt_AvgLogFC_t120".
In the cell below the "wt_AvgLogFC_t15" header, the equation "=AVERAGE(D2:G2)" was input.
The black plus sign at the bottom right corner of this cell was double-clicked to apply the formula to the remainder of the column.
Steps 5 and 6 were repeated for the t30, t60, t90, and t120 data using the corresponding columns of data.
In the next empty column to the right of the "wt_AvgLogFC_t120" calculation, a new header was created labeled "wt_ss_HO".
In the first cell below this column header, the equation "=SUMSQ(D2:Z2)" was input to include all the LogFC data in row 2, not including the AvgLogFC data. This formula was applied to the rest of the column.
In the next column to the right, five more column headers were created labeled "wt_ss_t15", "wt_ss_t30", "wt_ss_t60", "wt_ss_t90", and "wt_ss_t120".
In the first cell below the header "wt_ss_t15", the equation "=SUMSQ(D2:G2)-COUNTA(D2:G2)*AA2^2" was input, which uses the data range associated with t15 and the cell number where the AvgLogFC for t15 was computed. This equation was applied to the rest of the column.
The same computation was done for the other 4 time points using the applicable data. These equations were also applied to the entirety of their columns.
In the next empty column to the right, a new column titled "wt_ss_full" was created.
In the first row below this header, the equation "=SUM(AG2:AK2)" was created, which used the cells containing "ss" for each time point. This equation was applied to the entire column.
In the next two columns to the right, the headers "wt_Fstat" and "wt_p-value" were created.
Using the total number n=23, the equation "=((23-5)/5)*(AF2-AL2)/AL2" was entered in the first cell of the "wt_Fstat" column. The number 5 represents the number of timepoints. AF2 represents the cell for "wt_ss_HO" and AL2 represents the cell for "wt_SS_full". This formula was copied to the whole column.
In the first cell below the "wt_p-value" header, the equation "=FDIST(AM2,5,23-5)" was input, with AM2 corresponding to the cell in "wt_Fstat". This equation was copied to the entire column.

Sanity Check

The following was performed to ensure all the computations were done correctly thus far.

Cell A1 was clicked filled by the Data tab and then the Filter icon.
After clicking on the drop-down arrow on the "wt_p-value" column header, "Number Filters" was selected. This was used to set a criteria that would filter the data so that the p-value had to be less than 0.05.
The number in the lower left hand corner, which stated the number of rows that met this criterion, was compared with that of my homework partner, Sahil, to make sure everything was correct.

Calculating the Bonferroni and p-value Correction

The following adjustments were done to the p-value to correct for the multiple testing problem.

The next two open columns to the right were both labeled "wt_Bonferroni_p-value".
The equation "=AN2*6189" was input, which uses the corresponding cell in "wt_p-value". This formula was copied to the entire column.
In the first cell under the second "wt_Bonferroni_p-value" column, the formula "=IF(AO2>1,1,AO2)" was input, with AO2 corresponding to the cell for "wt_Bonferroni_p-value". This replaced any corrected p-value that is greater than 1 with the number 1. This formula was copied to the entire column.

Calculating the Benjamini & Hochberg p-value Correction

A new worksheet needed "wt_ANOVA_B-H" was created.
The "MasterIndex", "ID", and "Standard Name" columns from the previous worksheet were copied and pasted into the first three columns of the new worksheet.
Using Paste special > Paste values, the unadjusted p-values from the ANOVA worksheet were copied and pasted into Column D.
Columns A, B, C, and D were selected. The sort from A to Z button on the toolbar was clicked and used to sort by column D from smallest to largest.
Cell E1 was named "Rank" to create a series of numbers in ascending order from 1 to 6189, which ranked the p-values from smallest to largest. "1" was input into cell E2 and "2" into cell E3. Then the black plus sign in the bottom right corner was double-clicked to fill the entire column.
Cell F1 was labeled "wt_B-H_p-value".
The formula "=(D2*6189)/E2" was input into cell F2 and was copied to the entire column.
Cell G1 was labeled "wt_B-H_p-value". The formula "=IF(F2>1, 1, F2)" was input into cell G2 and was copied to the entire column.
Columns A through G were selected and sorted by the MasterIndex in Column A in ascending order.
Column G was copied and special pasted into the next empty column on the right of the ANOVA sheet.
The excel file was saved, uploaded to Box, and sent to Dr. Dahlquist and Dr. Fitzpatrick via e-mail.

Sanity Check

The following was performed to ensure that the data analysis was performed correctly. This helped find the number of genes that are significantly changed at various p-value cutoffs.

Row 1 of the "wt_ANOVA" worksheet was selected.
The menu item Data > Filter > Autofilter was selected, which added drop-down arrows to the top of each column. These allow filtering of the data according to specific set criteria.
The drop-down arrow for the unadjusted p-value was clicked and criteria was set to filter the data so that the p-value had to be less than 0.05.

Clustering and GO Term Enrichment with Stem: Part 2

The microarray data file was prepared for loading into STEM by creating a new worksheet in the Excel workbook named "wt_stem".
All the data from the "wt_ANOVA" worksheet was selected and pasted into the "wt_stem" worksheet using Paste special > paste values.
The column header "Master_Index" was renamed "SPOT", and the "ID" header in column B was named "Gene Symbol". The column "Standard_Name" was deleted.
The data was filtered on the B-H corrected p-value to be > 0.05.
All the rows, except for the header row, were selected and deleted by right-clicking and choosing "Delete Row" from the context menu. The filter was then removed to show only the genes with a "significant" change in expression (and not the noise).
All the data columns except the Average Log Fold change columns for each timepoint were deleted.
The data columns were renamed with just the time and units (ex. 15m, 30m, etc.).
The work was saved, then Save As was used to save the spreadsheet as Text (Tab-delimited) (*.txt). The warnings were dismissed and the file was closed.

The STEM software was downloaded and extracted, and the stem.zip file was downloaded to the Desktop.

The Gene Ontology and yeast GO annotations were downloaded and placed in this folder.

"Stem.jar" was double-clicked to launch the STEM program.

Running STEM

In section 1, the Browse button was clicked and used to navigate and select the file.
The radio button No normalization/add 0 was clicked and the box next to Spot IDs included in the data file was checked.
In section 2, the default selections for Gene Annotation Source, Cross Reference Source, and Gene Location Source were left as "User provided".
The "Browse..." button to the right of the "Gene Annotation File" item was clicked and used to browse to the "stem" folder and select the file "gene_association.sgd.gz". Open was clicked.
In section 3, the Clustering Method was set to "STEM Clustering Method" and the defaults for Maximum Number of Model Profiles or Maximum Unit Change ini Model Profiles between Time Points were not changed.
In section 4 the execute button was clicked to run STEM.

Viewing and Saving STEM Results

The "Interface Options..." button was clicked and under "X-axis scale should be:", the radio button that says "Based on real time" was selected. The Interface Options window was then closed.
A screenshot of this window was taken and added to a PowerPoint to save all figures.
Each significant profile (the colored ones) was clicked and screenshots of the individual profile windows were taken and added to the PowerPoint presentation.
For each of the profiles, the "Profile Gene Table" button was clicked to see the list of genes belonging to the profile. In the window that appears, the "Save Table" button was clicked and the file was saved to the desktop as "wt_profile#_genelist.txt", w9th the number symbol corresponding to the profile number. These files can be found under the "Data and Files" section below.
For each of the significant profiles, the "Profile GO Table" was clicked. The "Save Table" button was clicked and the file was saved to the desktop as "wt_profile#_GOlist.txt", where the number symbol represents the actual profile number. These files can be found under the "Data and Files" section below.

Analyzing and Interpreting STEM Results

Profile 28 was selected for further interpretation of the data.
The GO list file for this profile was opened in Excel.
The third row was selected and Data > Filter > Autofilter was used to filter on the "p-value" column to show only GO terms that have a p-value of < 0.05.
This was then repeated on the "Corrected p-value" column to show only GO terms that have a corrected p-value of < 0.05.
Six Gene Ontology terms from the filtered list of p < 0.05 were selected and defined using http://geneontology.org.
The GO ID of each term was copied and pasted into the search field on the left of the page.
The "Link to detailed information about <term>" button in the results page was clicked. The definition was found on the next results page.

Using YEASTRACT to Infer which Transcription Factors Regulate a Cluster of Genes

The gene list for Profile 28 was opened in Excel.
The list of gene IDs was copied and pasted onto the clipboard.
Safari was used to go to the YEASTRACT database.
In the left panel of the window, the link to Rank by TF was clicked.
The list of genes from the cluster was copied and pasted into the box labeled ORFs/Genes.
The box for Check for all TFs was checked and the defaults for the Regulations Filter were accepted. No filter for "Filter Documented Regulations by environmental condition" was applied.
The genes were ranked by TF using 'the % of genes in the list and in YEASTRACT regulated by each TF.'
The Search button was clicked.
The table of results from the web page was copied and pasted into a new Excel workbook.
The Excel file was uploaded to Box and can be found in the "Data and Files" section below.
GLN3, HAP4, and ZAP1 were added to the first 13 transcription factors when sorted by p-value.
The link to Generate Regulation Matrix on the YEASTRACT database was clicked.
The list of transcription factors chosen was copied and pasted into both the "Transcription factors" filed and the "Target ORF/Genes" field.
The "Regulations Filter" options of "Documented", "Only DNA binding evidence" were applied and the "Generate" button was clicked.
The link to the "Regulation matrix (Semicolon Separated Values (CSV) file)" in the results window was clicked and saved to the desktop as "wt_profile28_YEASTRACTregulationmatrix".

Visualizing Gene Regulatory Networks with GRNsight

The output files from YEASTRACT were properly formatted. The file was opened in Excel. Column A was selected and "Text to columns" under the "Data" tab was selected. "Delimited" was selected in the Wizard that appears and "Next" was clicked. In the next window, "Semicolon" was selected and then "Next". In the next window, the data format was left at "General" and "Finish" was clicked. This file was saved as a Microsoft Excel workbook.
A new worksheet was inserted into the Excel file and named "network". The entire matrix from the previous sheet was copied and pasted into the new sheet by clicking on the A1 cell and selecting "Paste special" from the "Home" tab. The box for "Transpose" was checked.
The "p" from each of the gene names in the columns was deleted and all the labels were made upper case.
The text "rows genes affected/cols genes controlling" was typed into cell A1.
The area of the entire adjacency matrix was selected and the custom sort button under the Data tab was clicked.
Column A was sorted alphabetically, excluding the header row.
Row 1 was then sorted from left to right, excluding cell A1. In the Custom Sort window, the options button was clicked and sort left to right was selected, excluding column 1.
The worksheet was saved as "network".
On the GRNsight home page, the menu item File > Open was used to select the regulation matrix .xlsx file that has the "network" worksheet in it.
The "Grid Layout" button was clicked to arrange the nodes in a grid. A screenshot was taken and added to the PowerPoint presentation.
Genes ARR1 and SMP1 were floating in the display, so they were deleted from the network by deleting both the corresponding rows and columns in the "network" sheet in the Excel workbook. This edited file was then re-uploaded to GRNsight to visualize it. This visualization was also screenshot and added to the PowerPoint presentation.

Results

Week 4

How many genes have p < 0.05? and what is the percentage (out of 6189)? 2528 genes (40.85%)

How many genes have p < 0.01? and what is the percentage (out of 6189)? 1652 genes (26.69%)

How many genes have p < 0.001? and what is the percentage (out of 6189)? 919 genes (14.85%)

How many genes have p < 0.0001? and what is the percentage (out of 6189)? 496 genes (8.01%)

How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 6189)? 248 genes (4.01%)

How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 6189)? 1822 genes (29.44%)

Comparing results with known data: the expression of the gene NSR1 (ID: YGR159C)is known to be induced by cold shock. What is its unadjusted, Bonferroni-corrected, and B-H-corrected p values?

Unadjusted: 2.87 x 10^-10

Bonferroni: 1.77 x 10^-6

B-H: 8.88 x 10^-7

What is its average Log fold change at each of the timepoints in the experiment?

T15: 3.28

T30: 3.62

T60: 3.53

T90: -2.05

T120: -.61

P-value Table:

File:BIOL388 S19 p-value slide AA.pdf

Week 5

Why did you select this profile? In other words, why was it interesting to you?

All the genes in Profile 28 follow a similar pattern of up-regulation until the 60 minute time point, where it sharply starts down regulating and returns to baseline. The number of genes is also significantly higher than what is expected.

How many genes belong to this profile?

104 genes

How many genes were expected to belong to this profile?

20.7 genes

What is the p-value for the enrichment of genes in this profile?

1.4x10^-39

How many GO terms are associated with this profile at p<0.05?

33 out of 257

How many GO terms are associated with this profile with a corrected p-value<0.05?

4 out of 257.

The following 6 terms were defined using Gene Ontology Resource (http://geneontology.org/):

Cellular amino acid biosynthetic process: "The chemical reactions and pathways resulting in the formation of amino acids, organic acids containing one or more amino substituents."
DNA binding transcription factor activity: "A protein or a member of a complex that interacts selectively and non-covalently with a specific DNA sequence (sometimes referred to as a motif) within the regulatory region of a gene to modulate transcription. Regulatory regions include promoters (proximal and distal) and enhancers. Genes are transcriptional units, and include bacterial operons."
Adenyl ribonucleotide binding: "Interacting selectively and non-covalently with an adenyl ribonucleotide, any compound consisting of adenosine esterified with (ortho)phosphate or an oligophosphate at any hydroxyl group on the ribose moiety."
Response to stress: "Any process that results in a change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of a disturbance in organismal or cellular homeostasis, usually, but not necessarily, exogenous (e.g. temperature, humidity, ionizing radiation)."
Active transmembrane transporter activity: "Enables the transfer of a specific substance or related group of substances from one side of a membrane to the other, up the solute's concentration gradient. The transporter binds the solute and undergoes a series of conformational changes. Transport works equally well in either direction."
Ligase activity: "Catalysis of the joining of two substances, or two groups within a single molecule, with the concomitant hydrolysis of the diphosphate bond in ATP or a similar triphosphate."

How many transcription factors are green or "significant"?

2 transcription factors

Are GLN3, HAP4, and/or ZAP1 on the list? If so, what is their "% in user set", "% in YEASTRACT", and "p-value"?

GLN3p: 20.39% in user set, 1.77% in YEASTRACT, p-value: 0.125
Hap4p: 21.36% in user set, 2.20% in YEASTRACT, p-value: 0.014
Zap1p: 33.98% in user set, 2.27% in YEASTRACT, p-value: 0.001

16 Selected Transcription Factors (based on smallest p-values)

Hap4p
GLN3p
Zap1p
Gcn4p
Upc2p
Arr1p
Ume6p
Leu3p
Smp1p
Rfx1p
Sfp1p
Yrm1p
Rpn4p
Hap1p
Ecm22p
Rox1p

Conclusion

The purpose was to analyze microarray data of the wild type of yeast undergoing cold-shock. After full analysis, six profiles were found to be significant. Profile 28 was chosen for further analysis, due to patterns of up-regulation until the 60 minute time point, where it sharply starts down regulating and returns to baseline. In addition the number of genes in this profile (104) is also significantly higher than what is expected (20.7). Using the corrected p-value<0.05, only 4 GO terms passed the filter. Only two transcription factors were found to be significant. These low numbers are expected because this strain is the wild type and was not mutated in any way.

Data and Files

The following files were generated through this exercise and are accessible through the LMU Box folder for this course.

Raw Data Filename: BIOL388_S19_microarray-data_wt.zip linked here

Week 4 Workbook Filename: BIOL388_S19_microarray-data_wt_AA.xlsx linked here

Week 5 Profile Gene Lists Folder: File:Wt profile genelists.zip

Week 5 Profile Go Lists Folder: File:Wt profile GOlists.zip

Week 5 YEASTRACT Table of Results: click here

Week 4&5 Figures Powerpoint: linked here

File of all data/excel sheets used: linked here

Acknowledgments

I worked with my homework partner, Sahil on performing the statistical analysis on Excel. We worked together in class on 2/12 and met for an hour and a half on 2/13 to compare our Excel workbooks. We also helped each other answer the questions in the protocol and create our p-value table. We also communicated over text for a few follow-up questions. We communicated again over text on 2/20 to make sure we were analyzing different profiles.

I also used Dr. Dahlquist and Dr. Fitzpatrick's help in class on 2/12 and on 2/19 when we began working on the assignment. Dr. Fitzpatrick helped by doing an example of the Excel workbook in class and helped make sure my zip files were working correctly. Dr. Dahlquist also confirmed over email that my box links sent correctly and did an example in class on 2/19.

Except for what is noted above, this individual journal entry was completed by me and not copied from another source. Angela C Abarquez (talk) 19:28, 13 February 2019 (PST)

References

The Week 4 Assignment was used to create this journal entry. Specifically, the "Experimental Design" section was used to describe the protocol. A template for the p-value table was also obtained from this assignment.

The data was obtained from Dr. Dahlquist's lab. She also provided the Excel file "BIOL388_S19_microarray-data_wt" with the raw data.

Gene Ontology Resource. (n.d.). Retrieved February 20, 2019, from http://geneontology.org/

GRNsight. (n.d.). Retrieved February 20, 2019, from http://dondi.github.io/GRNsight/

Stem software. (n.d.). Retrieved February 20, 2019, from http://www.cs.cmu.edu/~jernst/stem/

YEASTRACT. (n.d.). Retrieved February 20, 2019, from http://www.yeastract.com