Falghane Week 4

From OpenWetWare
Jump to navigationJump to search


The purpose of this assignment is to analyze the microarray data using the ANOVA statistical test and p-values by using Excel. Also, the data conducted in Excel is used for practicing data analysis by using different software.


  • The assignment focuses on Δhap4 yeast stain.
    • Analysing Data from: BIOL388_S19_microarray-data_dHAP4AL
    • TimePoints: t15, t30, t60, t90, and t120.
    • Replicates: 4 replicates for t15-t-60, 3 replicates to t90 and t120.
      • t15, t30, and t60 experienced cold shock at 13°C whereas t90 and t120 experienced cold shock at 13°C followed by 30/60 min recovery at 30°C

Statistical Analysis (ANOVA)

  1. A new worksheet (dHAP4_ANOVA) was created in excel
  2. The first three columns containing the "MasterIndex", "ID", and "Standard Name" from the "Master_Sheet" worksheet for the dHAP4 strain were copied and pasted in the new worksheet.
  3. At the top of the first column to the right data, five column headers were created of the form (STRAIN)_AvgLogFC_(TIME) where (STRAIN) is strain designation and (TIME) is 15, 30, etc.
  4. The average was then calculated for all the data from row 2 associated with (STRAIN) and t15 to give the average of the log fold change data.
  5. Step 4 was repeated for t-30 to t120.
  6. In the first empty column to the right of the (STRAIN)_AvgLogFC_t120 calculation, the column header (STRAIN)_ss_HO was created.
  7. In the first cell below the header (<range of cells for logFC_t15>)-COUNTA(<range of cells for logFC_t15>)*<AvgLogFC_t15>^2 was typed for dHAP4 and used for all the time points.
  8. The number of the data point at each time point was noted and were as follows: t15-t60: 4, t60 &t120: 3, and Total: 18
  9. The formula =SUMSQ(<range of cells for logFC_t15>)-COUNTA(<range of cells for logFC_t15>)*<AvgLogFC_t15>^2 was used in the first cell below the header (STRAIN)_ss_t15. Then <range of cells for logFC_t15> was replaced by data range associated with t15 and <AvgLogFC_t15> was replaced by the average for t15, and the averages of the other time points were found.
  10. In the first column to the right of (STRAIN)_ss_t120, the column header (STRAIN)_SS_full was created.
  11. The sum of the range of cells containing "ss" was found.
  12. In the next two columns to the right, the headers (STRAIN)_Fstat and (STRAIN)_p-value were created.
  13. In the first cell of the (STRAIN)_Fstat column, the formula =((n-5)/5)*(<(STRAIN)_ss_HO>-<(STRAIN)_SS_full>)/<(STRAIN)_SS_full> was used. The phrases <(STRAIN)_ss_HO> and <(STRAIN)_SS_full>, were replaced by dHAP4_ss_HO and dHAP4_SS_full. The value of replaced by number 18 too.

Sanity Check

  • clicked in A1 cell and then clicked on the Data tab. Filter icon was selected and from the drop-down arrows that appeared dHAP4_p-value was clicked.
  • A criterion that filter data so that the p-value has to be less than or equal to 0.05 was chosen.
  • Excel displayed the rows that correspond to data meeting that filtering criterion which resulted in 2294 data points being filtered out of the 6189 total.

Calculation of the Bonferroni and p-value Correction

  1. The next two columns to the right of dHAP4_p-value were labeled dHAP4_p_Bonferroni_p-value
  2. The equation =<(dHAP4)_p-value>*6189 was used Upon completion of this single computation and was used throughout the column.
  3. any corrected p-value that is greater than 1 was replaced by the number 1 by typing the following formula into the first cell below the second (dHAP4)_Bonferroni_p-value header: =IF(dHAP4_Bonferroni_p-value>1,1,dHAP4_Bonferroni_p-value), where "dHAP4_Bonferroni_p-value" refers to the cell in which the first Bonferroni p-value computation was made. And then the formula was copied throughout the column.

Calculation the Benjamini & Hochberg p-value Correction

  1. A worksheet named dHAP4_ANOVA_B-H was created
  2. the "MasterIndex", "ID", and "Standard Name" columns from the previous worksheet was copied and pasted into the first two columns of the new worksheet.
  3. unadjusted p-values from your ANOVA worksheet was copied and pasted it into Column D using Paste special > Paste values.
  4. columns A, B, C, and D were selected and sorted by ascending values on Column D by clicking the sort button from A to Z on the toolbar, and in the window that appears, it was sorted by column D, smallest to largest.
  5. The header "Rank" in cell E1 was typed. A series of numbers in ascending order from 1 to 6189 was created in this column which is the p-value rank, smallest to largest.
  6. "1" was typed into cell E2 and "2" into cell E3. then Both cells E2 and E3 were selected. and the column was filled with a series of numbers from 1 to 6189.
  7. Benjamini and Hochberg p-value correction was calculated by typing (dHAP4)_B-H_p-value in cell F1 and the following formula in cell F2: =(D2*6189)/E2. Then the equation was copied to the entire column.
  8. "dHAP4-H_p-value" was typed into cell G1.
  9. The following formula was typed into cell G2: =IF(F2>1,1,F2)
  10. columns A through G were selected and sorted by MasterIndex in Column A in ascending order.
  11. column G was copied and pasted using Paste special > Paste values into the next column on the right of your ANOVA sheet.

Sanity Check: Number of genes significantly changed

  • used to find out the number of genes that are significantly changed at various p value cut-offs.
  • Row 1 in the dHAP4_ANOVA worksheet was selected and from the menu, Data > Filter > Autofilter. Little drop-down arrows appeared at the top of each column. from there the data was filtered according to set criteria.
  • The drop-down arrow for the unadjusted p-value was clicked on and criteria that set the data so that the p-value has to be less than 0.05 was chosen.
  • Bonferroni and Benjamini and Hochberg corrections were applied to the above-unadjusted p values.

Clustering and GO Term Enrichment with stem (part 2)

  • A new worksheet was inserted Excel workbook, and named it "dHAP4_stem".
  • All data from dHAP4_ANOVA worksheet were copied and pasted into the new worksheet by paste special.
  • The leftmost column was renamed from "Master_index" into "SPOT" and column B was renamed from "ID" to "Gene symbol," and the column named "Standard_name" was deleted.
  • The data in the B-H was filtered and the p-value was corrected to be > 0.05.
    • once the data was filtered, all the rows except for the header row were selected and deleted.
    • the filter was then undone to ensure that only the genes with a "significant" change in expression and not the noise get clustered.
  • Then all the data columns were deleted except for the average log columns.
  • The data columns were then renamed with just the time and time unit.
  • The work was then saved.
  • The stem software was then downloaded

Running Stem

  • The file was selected in section 1 (Expression Data ) of the main STEM interface window.
  • then the radio button "No normalization/add 0" was clicked on and the box next to "Spot ID's included in Data Files" was checked.

Viewing and Saving STEM Results

  • A new window opened called "All STEM Profiles (1)". Each box corresponds to a model expression profile. Colored profiles have a statistically significant number of genes assigned; they are arranged in order from most to least significant p value.
  • Profiles with the same color belong to the same cluster of profiles and the number in each box is simply an ID number for the profile.
  • The button that says "Interface Options..." was clicked.
  • At the bottom of the Interface Options window that appears below where it says "X-axis scale should be:", the radio button that says "Based on real time" was clicked. Then the Interface Options window was closed.
  • A screenshot was taken of this window and pasted it into a PowerPoint presentation to save the figures.
  • Then each of the SIGNIFICANT profiles (the colored ones) were clicked on to open a window showing a more detailed plot containing all of the genes in that profile.
  • a screenshot was taken of each of the individual profile windows and saved into the PowerPoint presentation.
  • There were two yellow buttons at the bottom of each profile window, "Profile Gene Table" and "Profile GO Table". For each of the profiles, the "Profile Gene Table" button was clicked on to see the list of genes belonging to the profile and in the window that appeared, the "Save Table" button was clicked on and the file was saved after the filename was made descriptive of the contents.
  • Those files were then uploaded to OpenWetWare and linked to the individual journal page.
  • For each of the significant profiles, the "Profile GO Table" was clicked on to see the list of Gene Ontology terms belonging to the profile.
  • In the window that appeared "Save Table" button was clicked on and the file was saved.
  • The filename was also made descriptive of the content.

Using YEASTRACT to Infer which Transcription Factors Regulate a Cluster of Genes

  • The gene list in Excel for one of the significant profiles from stem analysis was opened.
  • A cluster with a clear cold shock/recovery up/down or down/up pattern was chosen along with one of the largest clusters.
  • The list of gene IDs were copied onto your clipboard.
  • A web browser was launched YEASTRACT database was opened.
  • On the left panel of the window, the link to Rank by TF was clicked.
  • The list of genes from the cluster were pasted into the box labeled ORFs/Genes.
  • The box for Check for all TFs was checked.
  • then the defaults for the Regulations Filter was accepted (Documented, DNA binding plus expression evidence).
  • No filter was applied for "Filter Documented Regulations by environmental condition".
  • Genes were then ranked by TF using: The % of genes in the list and in YEASTRACT regulated by each TF.
  • Search button was clicked.
  • For the mathematical model that will be built, 15-20 transcription factors were chosen from YEASTRACT.
  • Those were chosen from the list of "significant" transcription factors.
  • GLN3, HAP4, and ZAP1 were added if they were not already in the list.
  • the list and your justification used based on being most significant:
    • ADR1
    • AFT1
    • AFT2
    • CUP2
    • GCN4
    • GIS1
    • GLN3
    • HAP1
    • HAP4
    • HOT1
    • MGA2
    • MIG1
    • MSN2
    • MSN4
    • PDR1
    • RGM1
    • SUT1
  • After that the link in YEASTRACT database to o Generate Regulation Matrix was followed.
  • The list of transcription factors above was copied and pasted into both the "Transcription factors" field and the "Target ORF/Genes" field.
  • The"Regulations Filter" options of "Documented" was used, "Only DNA binding evidence"
  • The "Generate" button was clicked.
  • In the results window that appeared, the link to the "Regulation matrix (Semicolon Separated Values (CSV) file)" that appeared was clicked and saved it to Desktop.
  • The file was renamed with a meaningful name so that it can be distinguished from the other files generated/ will be generated.

Visualizing Your Gene Regulatory Networks with GRNsight

  • used to analyze regulatory matrix and visualize to determine which one will be appropriate to pursue further in the modeling.
  • first the output files from YEASTRACT were properly formatted by Opening the file in Excel.
  • It did not open properly in Excel because a semicolon was used as the column delimiter instead of a comma. To fix this, the entire Column A was selected. then "Text to columns" was selected from the "Data" tab. In the Wizard that appeared, "Delimited" was selected and then "Next" was clicked.
  • In the next window, "Semicolon" was clicked, and then "Next". In the next window, the data format was left at "General", and then "Finish" was clicked. *It then looked like a table with the names of the transcription factors across the top and down the first column and all of the zeros and ones distributed throughout the rows and columns. This is called an "adjacency matrix." If there is a "1" in the cell, that means there is a connection between the transcription factor in that row with that column.
  • This file was saved in Microsoft Excel workbook format (.xlsx).
  • For this adjacency matrix to be usable in GRNmap (the modeling software) and GRNsight (the visualization software), the matrix transposed.
  • a new worksheet was inserted into Excel file and named "network".
  • the entire matrix was selected and copied from the previous worksheet.
  • In the new worksheet, A1 cell was clicked on the upper left.
  • "Paste special" was selected from the "Home" tab.
  • In the window that appeared, the box for "Transpose" was checked.
  • Data was pasted with the columns transposed to rows and vice versa. This is necessary because now the transcription factors that are the "regulatORS" across the top and the "regulatEES" along the side.
  • The labels for the genes in the columns and rows need to match. Thus, the "p" from each of the gene names in the columns was deleted and the case of the labels was adjusted to make them all upper case.
  • In cell A1, the text "rows genes affected/cols genes controlling" was copied and pasted.
  • Then for ease of working with the adjacency matrix in Excel, the gene labels were alphabetized both across the top and side by:
    • Selecting the area of the entire adjacency matrix.
    • Clicking the Data tab and clicking the custom sort button.
    • Sorting Column A alphabetically, being sure to exclude the header row.
    • Then sorting row 1 from left to right, excluding cell A1.
    • In the Custom Sort window, the options button was clicked and sort left to right was selected, excluding column 1.
  • The worksheet containing organized adjacency matrix was then named "network" and Saved.
  • Then what the gene regulatory networks look like was visualized using GRNsight software:
    • first GRNsight home page was opened.
    • Then the menu item File was selected, then the regulation matrix .xlsx file that has the "network" worksheet in it that was formatted above was selected. **GRNsight automatically created a graph of the network as the file was formatted properly.
    • the "Grid Layout" button was clicked to arrange the nodes in a grid and a screenshot of the results was taken and pasted into the PowerPoint presentation.
    • As floating nods appeared that are not connected to any other nods, they were deleted from the network for the modeling to work properly.
    • Both the row and column with the floating gene's name were then deleted from the Excel workbook and network sheet. Then the edited file was re-uploaded to GRNsight to visualize it and this final version was used in the PowerPoint and subsequent modeling.


Sainty Check Results

  • Genes that have p < 0.05? and the percentage out of 6189: 2479, 40%
  • Genes that have p < 0.01? and the percentage out of 6189:1583, 25.6%
  • Genes that have p < 0.001? and the percentage out of 6189: 739, 12%
  • Genes that have p < 0.0001? and the percentage out of 6189: 280, 4.5%
  • Genes that are p < 0.05 for the Bonferroni-corrected p-value? and the percentage (out of 6189): 75, 1.2%
  • Genes that are p < 0.05 for the Benjamini and Hochberg-corrected p-value? and the percentage out of 6189: 1735, 28%


  • Gln3: % in user set: 14.29, % in yeastract: 3.87. P-value: 0.8685
  • HAP4:% in user set: 26.32, % in yeastract: 8.52, P-value: 6.7*10^-10
  • Zap1:*% in user set: 38.39, % in yeastract: 8.66, P-value: 6.7*10^-13

The following transcription factors were used:

  • ADR1
  • AFT1
  • AFT2
  • CUP2
  • GCN4
  • GIS1
  • GLN3
  • HAP1
  • HAP4
  • HOT1
  • MGA2
  • MIG1
  • MSN2
  • MSN4
  • PDR1
  • RGM1
  • SUT1

STEM Results Results:

  • The profile I chose was 22
  • I chose it because this profile is one of the most significant ones.
  • Genes belong to this profile: 324
  • Genes that were expected to belong to this profile:85
  • p-value for the enrichment of genes in this profile: 9.4E-90
  • GO terms associated with this profile at p < 0.05: 177/701
  • GO terms associated with this profile with a corrected p-value < 0.05: 177/ 701

Gene Ontology Terms

  • Sexual Sporulation (GO:0034293): The formation of spores derived from the products of meiosis.  
  • Monosaccharide metabolic process (GO:0005996): The chemical reactions and pathways involving monosaccharides, the simplest carbohydrates. They are polyhydric alcohols containing either an aldehyde or a keto group and between three to ten or more carbon atoms. They form the constitutional repeating units of oligo- and polysaccharides. 
  • Actin polymerization or depolymerization (GO:000815): Assembly or disassembly of actin filaments by the addition or removal of actin monomers from a filament. 
  • Aerobic respiration (GO:0009060): The enzymatic release of energy from inorganic and organic compounds (especially carbohydrates and fats) which requires oxygen as the terminal electron acceptor.   
  • Cell Differentiation (GO:0030154): The process in which relatively unspecialized cells, e.g. embryonic or regenerative cells, acquire specialized structural and/or functional features that characterize the cells, tissues, or organs of the mature organism or some other relatively stable phase of the organism's life history. Differentiation includes the processes involved in commitment of a cell to a specific fate and its subsequent development to the mature state  
  • Peroxiredoxin activity (GO:0051920): Catalysis of the reaction: 2 R'-SH + ROOH = R'-S-S-R' + H2O + ROH.  

Scientific Conclusion

  • P-value cut-off is a moveable confidence level value since by applying Bonferroni, as well as Benjamini and Hochberg corrections to the unadjusted p-values resulted in a more specific p-value. Also, since we have genes that pass the cut-off p, <0.05 and are more than 309, we can conclude that some of the values we found are significant and are not due to chance.
  • from the stem results we could conclude that 25% of the genes had a p-value <0.05 which indicates that they are significant in term of that they had a response to the cold shock.


I texted my partner Desiree Gonzalez to help me understand the prompt better. I also met with Dr. Kam D. Dahlquist to help with this assignment.

  • Except for what is noted above, this individual journal entry was completed by me and not copied from another source.

Falghane (talk) 15:34, 18 February 2019 (PST)


  • Instructions from Week 4 were followed .
  • Excel file "BIOL388_S19_microarray-data_wt" was provided by Dr. Dahlquist's as the data was collected in her lab. "BIOL388_S19_microarray-data_wt" with the raw data.
  • Definitions from "gene_ontology.obo"