Tessa A. Morris Electronic Lab Notebook 2015-2016

From OpenWetWare
Jump to: navigation, search

22 September 2015

Normalization

Github issue 127

Installing R 3.1.0 and the limma package

  • Download version 3.1.0 of R released in April 2014 (link to download site) and and version 3.20.1 of the limma package ( direct link to download zipped file) on the Windows 7 platform.
  • To use the limma package, unzip the file and place the contents into a folder called "limma" in the library directory of the R program.
  • There was an error when trying to find limma. A solution was to install limma on R, open R and type the following, which was taken from Section 2.2

Running the Normalization Script

  • Create a folder on your Desktop to store your files for the microarray analysis procedure (Microarray_Analysis_TM-2015-09-22).
  • Download the zipped file that contains the .gpr files and save it to this folder (or move it if it saved in a different folder).
    • Unzip this file using 7-zip. Right-click on the file and select the menu item, "7-zip > Extract Here".
  • Download the GCAT_Targets.csv file and Ontario_Targets_wt-dCIN5-dGLN3-dHAP4-dHMO1-dSWI4-dZAP1-Spar_20150514.csv files and save them to this folder (or move them if they saved to a different folder).
  • Download the bigFatNormalization2.R script and save (or move) it to this folder.
  • Download the generatePlots1.R script and save (or move) it to this folder.
  • Launch R x64 3.1.0 (make sure you are using the 64-bit version).
  • Change the directory to the folder containing the targets file and the GPR files for the chips by selecting the menu item File > Change dir... and clicking on the appropriate directory. You will need to click on the + sign to drill down to the right directory. Once you have selected it, click OK.
  • In R, select the menu item File > Source R code..., and select the bigFatNormalization2.R script.
    • Wait while R processes your files.
  • When the processing has finished, you will find three files called GCAT_and_Ontario_Unnormalized.csv, GCAT_and_Ontario_Within_Array_Normalization.csv, GCAT_and_Ontario_Between_Array_Normalization.csv. The latter file is the one that you will need going forward.
    • Save these files to LionShare.

Visualizing the Normalized Data

  • Immediately after running the normalization script, select the menu item File > Source R code..., and select the generatePlots1.R script.
    • Wait while R processes your files. You will see the individual plots being created in a new window. R will save them automatically to the same folder that contains the data and scripts.
      • The box plots for each strain (comparison of the before, after within- and after between-chip normalization in the same file) are saved under the name of the strain, e.g., dCIN5.jpg.
      • The MA plots are saved under a name for the individual chip, e.g., dCIN5_LogFC_t15-1.jpg, and show the plots both before and after normalization.
  • Zip the files of the plots together and upload to LionShare and/or save to a flash drive.

Statistical Analysis

  • For the statistical analysis, we will begin with the file "GCAT_and_Ontario_Between_Array_Normalization.csv" that you generated in the previous step.
  • Open this file in Excel and Save As an Excel Workbook *.xlsx. It is a good idea to add your initials and the date (yyyymmdd) to the filename as well.
  • Rename the worksheet with the data "Compiled_Normalized_Data".
    • Type the header "ID" in cell A1.
    • Insert a new column after column A and name it "Standard Name". Column B will contain the common names for the genes on the microarray.
      • Copy the entire column of IDs from Column A.
      • Paste the names into the "Value" field of the ORF List <-> Gene List tool in YEASTRACT. Then, click on the "Transform" button.
      • Select all of the names in the "Gene Name" column of the resulting table.
      • Copy and paste these names into column B of the *.xlsx file. Save your work.
    • Insert a new column on the very left and name it "MasterIndex". We will create a numerical index of genes so that we can always sort them back into the same order.
      • Type a "1" in cell A2 and a "2" in cell A3.
      • Select both cells. Hover your mouse over the bottom-right corner of the selection until it makes a thin black + sign. Double-click on the + sign to fill the entire column with a series of numbers from 1 to 6189 (the number of genes on the microarray).
  • Insert a new worksheet and call it "Rounded_Normalized_Data". We are going to round the normalization results to four decimal places because of slight variations seen in different runs of the normalization script.
    • Copy the first three columns of the "Compiled_Normalized_Data" sheet and paste it into the first three columns of the "Rounded_Normalized_Data" sheet.
    • Copy the first row of the "Compiled_Normalized_Data" sheet and paste it into the first row of the "Rounded_Normalized_Data" sheet.
    • In cell D2, type the equation =ROUND(Compiled_Normalized_Data!D2,4).
    • Copy and paste this equation in the rest of the cells of row 2.
    • Select all of the cells of row 2 and hover your mouse over the bottom right corner of the selection. When the cursor changes to a thin black "plus" sign, double-click on it to paste the equation to all the rows in the worksheet. Save your work.
  • Insert a new worksheet and call it "Master_Sheet".
    • Go back to the "Rounded_Normalized_Data" sheet and Select All and Copy.
    • Click on cell A1 of the "Master_Sheet" worksheet. Select Paste special > Paste values to paste the values, but not the formulas from the previous sheet. Save your work.
    • There will be some #VALUE! errors in cells where there was missing data for genes that existed on the Ontario chips, but not the GCAT chips.
      • Select the menu item Find/Replace and Find all cells with "#VALUE!" and replace them with a single space character. Record how many replacements were made to your electronic lab notebook. Save your work.
      • There were 41998 replacements
  • This will be the starting point for our statistical analysis below.

Assignments for Normalization

Github issue 137

  • Kevin & Monica: Δswi4, Δhap4
  • Tessa: Δgln3, Δcin5, Δhap4
  • Kristen: Δzap1, Δgln3, Δswi4
  • Kayla: Δzap1, Δcin5, wt
  • Natalie: spar, wt, Δhmo1
  • Grace: Δhmo1, wt, spar

Converting Test_files into .xlsx

Github issue 126

  1. Download "Git" from this link
  2. Go through the instillation process (no changes needed)
  3. Once installed open "Git Bash" from the programs menu on the computer
  4. To have the GRNmap folder to edit on the desktop type cd Desktop then press enter
  5. To clone into GRNmap type git clone https://github.com/kdahlquist/GRNmap.git then press enter
  6. To edit GRNmap, type cd GRNmap then press enter
  7. To edit the Beta version, type git checkout beta
  8. There will now be a file on the desktop titled "GRNmap"
  9. Open this folder and navigate to the "test_files" folder.
    • Type git checkout beta then press enter
    • Type git status then press enter will show the changes that you made in red
    • Type git add ., which will add all folders.
    • Type git status then press enter
    • Type git commit -m "convert all test files from xls format to xlsx" (description of the change that was made) then press enter
    • In order to configure the computer to your user name Type git config --global user.name "github username" then press enter and then type git config --global user.email "email used for github" then press enter
    • To update anything that was changed while you were working type git pull then press enter
    • To update the website with your changes type git push then press enter and enter your username and password when prompted.
  10. Comment on 126 on Github explaining what was done and change the label to "review requested" and select Dr. Dahlquist.

28 September 2015

Within-Strain ANOVA

Github Issue 137

  1. Create a new worksheet, naming it either "(STRAIN)_ANOVA" as appropriate. For example, you might call yours "wt_ANOVA" or "dHAP4_ANOVA"
  2. Copy the first three columns containing the "MasterIndex", "ID", and "Standard Name" from the "Master_Sheet" worksheet for your strain and paste it into your new worksheet. Copy the columns containing the data for your strain and paste it into your new worksheet.
  3. At the top of the first column to the right of your data, create five column headers of the form (STRAIN)_AvgLogFC_(TIME) where (STRAIN) is your strain designation and (TIME) is 15, 30, etc.
  4. In the cell below the (STRAIN)_AvgLogFC_t15 header, type =AVERAGE(
  5. Then highlight all the data in row 2 associated with (STRAIN) and t15, press the closing paren key (shift 0),and press the "enter" key.
  6. This cell now contains the average of the log fold change data from the first gene at t=15 minutes.
  7. Click on this cell and position your cursor at the bottom right corner. You should see your cursor change to a thin black plus sign (not a chubby white one). When it does, double click, and the formula will magically be copied to the entire column of 6188 other genes.
  8. Repeat steps (4) through (8) with the t30, t60, t90, and the t120 data.
  9. Now in the first empty column to the right of the (STRAIN)_AvgLogFC_t120 calculation, create the column header (STRAIN)_ss_HO.
  10. In the first cell below this header, type =SUMSQ(
  11. Highlight all the LogFC data in row 2 for your (STRAIN) (but not the AvgLogFC), press the closing paren key (shift 0),and press the "enter" key.
  12. In the next empty column to the right of (STRAIN)_ss_HO, create the column headers (STRAIN)_ss_(TIME) as in (3).
  13. Make a note of how many data points you have at each time point for your strain. For most of the strains, it will be 4, but for dHAP4 t90 or t120, it will be "3", and for the wild type it will be "4" or "5". Count carefully. Also, make a note of the total number of data points. Again, for most strains, this will be 20, but for example, dHAP4, this number will be 18, and for wt it should be 23 (double-check).
    • CIN5 - 20 replicates
      • 15: 4; 30: 4; 60: 4; 90: 4; 120: 4
    • GLN3 - 20 replicates
      • 15: 4; 30: 4; 60: 4; 90: 4; 120: 4
    • HAP4 - 18 replicates
      • 15: 4; 30: 4; 60: 4; 90: 3; 120: 3
  14. In the first cell below the header (STRAIN)_ss_t15, type =SUMSQ(<range of cells for logFC_t15>)-<number of data points>*<AvgLogFC_t15>^2 and hit enter.
    • The phrase <range of cells for logFC_t15> should be replaced by the data range associated with t15.
    • The phrase <number of data points> should be replaced by the number of data points for that timepoint (either 3, 4, or 5).
    • The phrase <AvgLogFC_t15> should be replaced by the cell number in which you computed the AvgLogFC for t15, and the "^2" squares that value.
    • Upon completion of this single computation, use the Step (7) trick to copy the formula throughout the column.
  15. Repeat this computation for the t30 through t120 data points. Again, be sure to get the data for each time point, type the right number of data points, and get the average from the appropriate cell for each time point, and copy the formula to the whole column for each computation.
  16. In the first column to the right of (STRAIN)_ss_t120, create the column header (STRAIN)_SS_full.
  17. In the first row below this header, type =sum(<range of cells containing "ss" for each timepoint>) and hit enter.
  18. In the next two columns to the right, create the headers (STRAIN)_Fstat and (STRAIN)_p-value.
  19. Recall the number of data points from (13): call that total n.
  20. In the first cell of the (STRAIN)_Fstat column, type =((n-5)/5)*(<(STRAIN)_ss_HO>-<(STRAIN)_SS_full>)/<(STRAIN)_SS_full> and hit enter.
    • Don't actually type the n but instead use the number from (13). Also note that "5" is the number of timepoints and the dSWI4 strain has 4 timepoints (it is missing t15).
    • Replace the phrase (STRAIN)_ss_HO with the cell designation.
    • Replace the phrase <(STRAIN)_SS_full> with the cell designation.
    • Copy to the whole column.
  21. In the first cell below the (STRAIN)_p-value header, type =FDIST(<(STRAIN)_Fstat>,5,n-5) replacing the phrase <(STRAIN)_Fstat> with the cell designation and the "n" as in (13) with the number of data points total. (Again, note that the number of timepoints is actually "4" for the dSWI4 strain). Copy to the whole column.
  22. Before we move on to the next step, we will perform a quick sanity check to see if we did all of these computations correctly.
    • Click on cell A1 and click on the Data tab. Select the Filter icon (looks like a funnel). Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
    • Click on the drop-down arrow on your (STRAIN)_p-value column. Select "Number Filters". In the window that appears, set a criterion that will filter your data so that the p value has to be less than 0.05.
      • CIN5: 2438
      • GLN3: 2255
      • HAP4: 2491
    • Excel will now only display the rows that correspond to data meeting that filtering criterion. A number will appear in the lower left hand corner of the window giving you the number of rows that meet that criterion. We will check our results with each other to make sure that the computations were performed correctly.

Calculate the Bonferroni and p value Correction

  1. Now we will perform adjustments to the p value to correct for the multiple testing problem. Label the next two columns to the right with the same label, (STRAIN)_Bonferroni_p-value.
  2. Type the equation =<(STRAIN)_p-value>*6189, Upon completion of this single computation, use the Step (10) trick to copy the formula throughout the column.
  3. Replace any corrected p value that is greater than 1 by the number 1 by typing the following formula into the first cell below the second (STRAIN)_Bonferroni_p-value header: =IF(r2>1,1,r2). Use the Step (10) trick to copy the formula throughout the column.

Calculate the Benjamini & Hochberg p value Correction

  1. Insert a new worksheet named "(STRAIN)_ANOVA_B-H".
  2. Copy and paste the "MasterIndex", "ID", and "Standard Name" columns from your previous worksheet into the first two columns of the new worksheet.
  3. For the following, use Paste special > Paste values. Copy your unadjusted p values from your ANOVA worksheet and paste it into Column D.
  4. Select all of columns A, B, C, and D. Sort by ascending values on Column D. Click the sort button from A to Z on the toolbar, in the window that appears, sort by column D, smallest to largest.
  5. Type the header "Rank" in cell E1. We will create a series of numbers in ascending order from 1 to 6189 in this column. This is the p value rank, smallest to largest. Type "1" into cell E2 and "2" into cell E3. Select both cells E2 and E3. Double-click on the plus sign on the lower right-hand corner of your selection to fill the column with a series of numbers from 1 to 6189.
  6. Now you can calculate the Benjamini and Hochberg p value correction. Type (STRAIN)_B-H_p-value in cell F1. Type the following formula in cell F2: =(D2*6189)/E2 and press enter. Copy that equation to the entire column.
  7. Type "STRAIN_B-H_p-value" into cell G1.
  8. Type the following formula into cell G2: =IF(F2>1,1,F2) and press enter. Copy that equation to the entire column.
  9. Select columns A through G. Now sort them by your MasterIndex in Column A in ascending order.
  10. Copy column G and use Paste special > Paste values to paste it into the next column on the right of your ANOVA sheet.
  • Upload the .xlsx file that you have just created to LionShare. Send Dr. Dahlquist an e-mail with the link to the file (e-mail kdahlquist at lmu dot edu).

Sanity Check: Number of genes significantly changed

Before we move on to further analysis of the data, we want to perform a more extensive sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs.

  • Go to your (STRAIN)_ANOVA worksheet.
  • Select row 1 (the row with your column headers) and select the menu item Data > Filter > Autofilter (The funnel icon on the Data tab). Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
  • Click on the drop-down arrow for the unadjusted p value. Set a criterion that will filter your data so that the p value has to be less than 0.05.
    • How many genes have p < 0.05? and what is the percentage (out of 6189)?
    • How many genes have p < 0.01? and what is the percentage (out of 6189)?
    • How many genes have p < 0.001? and what is the percentage (out of 6189)?
    • How many genes have p < 0.0001? and what is the percentage (out of 6189)?
  • When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero by chance less than 5% of the time.
  • We have just performed 6189 hypothesis tests. Another way to state what we are seeing with p < 0.05 is that we would expect to see this a gene expression change for at least one of the timepoints by chance in about 5% of our tests, or 309 times. Since we have more than 309 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know which ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
    • How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 6189)?
    • How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 6189)?
  • In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.
  • Comparing results with known data: the expression of the gene NSR1 (ID: YGR159C)is known to be induced by cold shock. Find NSR1 in your dataset. What is its unadjusted, Bonferroni-corrected, and B-H-corrected p values? What is its average Log fold change at each of the timepoints in the experiment? Note that the average Log fold change is what we called "STRAIN)_AvgLogFC_(TIME)" in step 3 of the ANOVA analysis.
  • We will compare the numbers we get between the wild type strain and the other strains studied, organized as a table. Use this sample PowerPoint slide to see how your table should be formatted.

Results of Sanity Check

Sanity Check
ΔCIN5 ΔGLN3 ΔHAP4
p < 0.05 2438 (39.39%) 2255 (36.44%) 2491(40.25%)
p < 0.01 1529 (24.71%) 1325 (21.41%) 1591(25.71%)
p < 0.001 838 (13.54%) 616 (9.95%) 769 (12.43%)
p < 0.0001 458 (7.40%) 259 (4.18%) 308 (4.98%)
Bonferroni p < 0.05 232 (3.75%) 106 (1.71%) 103 (1.66%)
B-H p < 0.05 1683 (27.29%) 1356 (21.91%) 1749 (28.26%)


For NSR1
ΔCIN5 ΔGLN3 ΔHAP4
unadjusted p-value 6.37625E-08 0.00050676 0.016364209
Bonferroni p-value 0.000394626 1 1
B-H p-value 5.5581E-06 0.00660287 0.055192421
AvgLogFC_t15 4.070025 3.506225 2.69945
AvgLogFC_t30 3.611475 4.5319 3.2508
AvgLogFC_t60 4.2985 2.7592 3.519975
AvgLogFC_t90 -2.900925 -1.85025 -1.1005667
AvgLogFC_t120 -0.9315 -1.867425 -1.797667

9 October 2015

Repeat Within Strain ANOVA, Bonferroni, B-H, and sanity check for ΔSWI4

Within-Strain ANOVA

  1. Create a new worksheet, naming it either "(STRAIN)_ANOVA" as appropriate. For example, you might call yours "wt_ANOVA" or "dHAP4_ANOVA"
  2. Copy the first three columns containing the "MasterIndex", "ID", and "Standard Name" from the "Master_Sheet" worksheet for your strain and paste it into your new worksheet. Copy the columns containing the data for your strain and paste it into your new worksheet.
  3. At the top of the first column to the right of your data, create five column headers of the form (STRAIN)_AvgLogFC_(TIME) where (STRAIN) is your strain designation and (TIME) is 15, 30, etc.
  4. In the cell below the (STRAIN)_AvgLogFC_t15 header, type =AVERAGE(
  5. Then highlight all the data in row 2 associated with (STRAIN) and t15, press the closing paren key (shift 0),and press the "enter" key.
  6. This cell now contains the average of the log fold change data from the first gene at t=15 minutes.
  7. Click on this cell and position your cursor at the bottom right corner. You should see your cursor change to a thin black plus sign (not a chubby white one). When it does, double click, and the formula will magically be copied to the entire column of 6188 other genes.
  8. Repeat steps (4) through (8) with the t30, t60, t90, and the t120 data.
  9. Now in the first empty column to the right of the (STRAIN)_AvgLogFC_t120 calculation, create the column header (STRAIN)_ss_HO.
  10. In the first cell below this header, type =SUMSQ(
  11. Highlight all the LogFC data in row 2 for your (STRAIN) (but not the AvgLogFC), press the closing paren key (shift 0),and press the "enter" key.
  12. In the next empty column to the right of (STRAIN)_ss_HO, create the column headers (STRAIN)_ss_(TIME) as in (3).
  13. Make a note of how many data points you have at each time point for your strain. For most of the strains, it will be 4, but for dHAP4 t90 or t120, it will be "3", and for the wild type it will be "4" or "5". Count carefully. Also, make a note of the total number of data points. Again, for most strains, this will be 20, but for example, dHAP4, this number will be 18, and for wt it should be 23 (double-check).
    • SWI4 -16 replicates
      • 15: 0; 30: 4; 60: 4; 90: 4; 120: 4
  14. In the first cell below the header (STRAIN)_ss_t15, type =SUMSQ(<range of cells for logFC_t15>)-<number of data points>*<AvgLogFC_t15>^2 and hit enter.
    • The phrase <range of cells for logFC_t15> should be replaced by the data range associated with t15.
    • The phrase <number of data points> should be replaced by the number of data points for that timepoint (either 3, 4, or 5).
    • The phrase <AvgLogFC_t15> should be replaced by the cell number in which you computed the AvgLogFC for t15, and the "^2" squares that value.
    • Upon completion of this single computation, use the Step (7) trick to copy the formula throughout the column.
  15. Repeat this computation for the t30 through t120 data points. Again, be sure to get the data for each time point, type the right number of data points, and get the average from the appropriate cell for each time point, and copy the formula to the whole column for each computation.
  16. In the first column to the right of (STRAIN)_ss_t120, create the column header (STRAIN)_SS_full.
  17. In the first row below this header, type =sum(<range of cells containing "ss" for each timepoint>) and hit enter.
  18. In the next two columns to the right, create the headers (STRAIN)_Fstat and (STRAIN)_p-value.
  19. Recall the number of data points from (13): call that total n.
  20. In the first cell of the (STRAIN)_Fstat column, type =((n-5)/5)*(<(STRAIN)_ss_HO>-<(STRAIN)_SS_full>)/<(STRAIN)_SS_full> and hit enter.
    • Don't actually type the n but instead use the number from (13). Also note that "5" is the number of timepoints and the dSWI4 strain has 4 timepoints (it is missing t15).
    • Replace the phrase (STRAIN)_ss_HO with the cell designation.
    • Replace the phrase <(STRAIN)_SS_full> with the cell designation.
    • Copy to the whole column.
  21. In the first cell below the (STRAIN)_p-value header, type =FDIST(<(STRAIN)_Fstat>,5,n-5) replacing the phrase <(STRAIN)_Fstat> with the cell designation and the "n" as in (13) with the number of data points total. (Again, note that the number of timepoints is actually "4" for the dSWI4 strain). Copy to the whole column.
  22. Before we move on to the next step, we will perform a quick sanity check to see if we did all of these computations correctly.
    • Click on cell A1 and click on the Data tab. Select the Filter icon (looks like a funnel). Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
    • Click on the drop-down arrow on your (STRAIN)_p-value column. Select "Number Filters". In the window that appears, set a criterion that will filter your data so that the p value has to be less than 0.05.
      • ΔSWI4: 2167
    • Excel will now only display the rows that correspond to data meeting that filtering criterion. A number will appear in the lower left hand corner of the window giving you the number of rows that meet that criterion. We will check our results with each other to make sure that the computations were performed correctly.

14 October 2015 (notes from lab meeting)

  • Assignments for creating Networks:
    • wt: Natalie
    • dHAP4: Grace
    • dCIN5: Kayla
    • dGLN3: Tessa
    • dZAP1: Kristen
  • Instructions for paring down network:
    • First, eliminate genes that are unconnected
    • Then, systematically pare down by p-value, one by one. For each elimination, check and get rid of unconnected genes. We want a gene list of less than 35 transcription factors. We want 15 - 35 range for our family of networks.
    • For now, we will focus on wild type and deletion strains (not Spar)
  • Two types of pare downs:
    • First, just the genes from YEASTRACT
    • Later, potentially adding in our deletion strain genes. Careful with elimination of TF's - deletion strain genes must stay in.


20 October 2015

Creating a Network using YEASTRACT

Using YEASTRACT to Infer which Transcription Factors Regulate a Cluster of Genes

  1. Filter your within-stain ANOVA, by B-H p-values less than 0.05.
    • Copy the list of gene IDs onto your clipboard.
  2. Launch a web browser and go to the YEASTRACT database.
    • On the left panel of the window, click on the link to Rank by TF.
    • Paste your list of genes from your cluster into the box labeled ORFs/Genes.
    • Check the box for Check for all TFs.
    • Accept the defaults for the Regulations Filter (Documented, DNA binding plus expression evidence)
    • Do not apply a filter for "Filter Documented Regulations by environmental condition".
    • Rank genes by TF using: The % of genes in the list and in YEASTRACT regulated by each TF.
    • Click the Search button.
  3. Answer the following questions:
    • In the results window that appears, the p values colored green are considered "significant", the ones colored yellow are considered "borderline significant" and the ones colored pink are considered "not significant". How many transcription factors are green or "significant"?
    • List the "significant" transcription factors on your wiki page, along with the corresponding "% in user set", "% in YEASTRACT", and "p value".
      • Are CIN5, GLN3, HAP4, HMO1, SWI4, and ZAP1 on the list?
  4. For the mathematical model that we will build, we need to define a gene regulatory network of transcription factors that regulate other transcription factors. We can use YEASTRACT to assist us with creating the network. We want to generate a network with approximately 15-30 transcription factors in it.
    • You need to select from this list of "significant" transcription factors, which ones you will use to run the model. You will use these transcription factors and add CIN5, GLN3, HAP4, HMO1, SWI4, and ZAP1 if they are not in your list. Explain in your electronic notebook how you decided on which transcription factors to include. Record the list and your justification in your electronic lab notebook.
    • Go back to the YEASTRACT database and follow the link to Generate Regulation Matrix.
    • Copy and paste the list of transcription factors you identified (plus CIN5, HAP4, GLN3, HMO1, SWI4, and ZAP1) into both the "Transcription factors" field and the "Target ORF/Genes" field.
    • We are going to generate several regulation matrices, with different "Regulations Filter" options.
      • For the first one, accept the defaults: "Documented", "DNA binding plus expression evidence"
      • Click the "Generate" button.
      • In the results window that appears, click on the link to the "Regulation matrix (Semicolon Separated Values (CSV) file)" that appears and save it to your Desktop. Rename this file with a meaningful name so that you can distinguish it from the other files you will generate.
      • Repeat these steps to generate a second regulation matrix, this time applying the Regulations Filter "Documented", "Only DNA binding evidence".
      • Repeat these steps a third time to generate a third regulation matrix, this time applying the Regulations Filter "Documented", DNA binding and expression evidence".

Visualizing Your Gene Regulatory Networks with GRNsight

We will analyze the regulatory matrix files you generated above in Microsoft Excel and visualize them using GRNsight to determine which one will be appropriate to pursue further in the modeling.

  1. First we need to properly format the output files from YEASTRACT. You will repeat these steps for each of the three files you generated above.
    • Open the file in Excel. It will not open properly in Excel because a semicolon was used as the column delimiter instead of a comma. To fix this, Select the entire Column A. Then go to the "Data" tab and select "Text to columns". In the Wizard that appears, select "Delimited" and click "Next". In the next window, select "Semicolon", and click "Next". In the next window, leave the data format at "General", and click "Finish". This should now look like a table with the names of the transcription factors across the top and down the first column and all of the zeros and ones distributed throughout the rows and columns. This is called an "adjacency matrix." If there is a "1" in the cell, that means there is a connection between the trancription factor in that row with that column.
    • Save this file in Microsoft Excel workbook format (.xlsx).
    • Check to see that all of the transcription factors in the matrix are connected to at least one of the other transcription factors by making sure that there is at least one "1" in a row or column for that transcription factor. If a factor is not connected to any other factor, delete its row and column from the matrix. Make sure that you still have somewhere between 15 and 30 transcription factors in your network after this pruning.
      • Only delete the transcription factor if there are all zeros in its column AND all zeros in its row. You may find visualizing the matrix in GRNsight (below) can help you find these easily.
    • For this adjacency matrix to be usable in GRNmap (the modeling software) and GRNsight (the visualization software), we need to transpose the matrix. Insert a new worksheet into your Excel file and name it "network". Go back to the previous sheet and select the entire matrix and copy it. Go to you new worksheet and click on the A1 cell in the upper left. Select "Paste special" from the "Home" tab. In the window that appears, check the box for "Transpose". This will paste your data with the columns transposed to rows and vice versa. This is necessary because we want the transcription factors that are the "regulatORS" across the top and the "regulatEES" along the side.
    • The labels for the genes in the columns and rows need to match. Thus, delete the "p" from each of the gene names in the columns. Adjust the case of the labels to make them all upper case.
    • In cell A1, copy and paste the text "rows genes affected/cols genes controlling".
  2. Now we will visualize what these gene regulatory networks look like with the GRNsight software.
    • Go to the GRNsight home page (you can either use the version on the home page or the beta version.
    • Select the menu item File > Open and select one of the regulation matrix .xlsx file that has the "network" worksheet in it that you formatted above. If the file has been formatted properly, GRNsight should automatically create a graph of your network. Move the nodes (genes) around until you get a layout that you like and take a screenshot of the results. Paste it into your PowerPoint presentation. Repeat with the other two regulation matrix files. You will want to arrange the genes in the same order for each screenshot so that the graphs can be easily compared.

Looking at the Network

  • Genes added to the network: Δcin5, Δgln3, Δhap4, Δhmo1, Δswi4
  • There are no genes that are unconnected to the network.
  • Network has 64 genes in the network
  • Bonferonni p-value < 0.05: 6
    • TEC1
    • HFI1
    • HMO1
    • SSN2
    • SUT2
    • MIG2
  • unadjusted p-value < 0.05: 27
    • TEC1
    • HFI1
    • HMO1
    • SSN2
    • SUT2
    • MIG2
    • SAS5
    • IXR1
    • CIN5
    • BAS1
    • YAP6
    • SOK2
    • MCM1
    • ZAP1
    • RLM1
    • YHP1
    • RDS1
    • HSF1
    • SPT20
    • ABF1
    • SFP1
    • HAP4
    • RIF1
    • YLR278C
    • INO4
    • ROX1
    • MSN4
  • B-H p-value < 0.05: 19
    • TEC1
    • HFI1
    • HMO1
    • SSN2
    • SUT2
    • MIG2
    • SAS5
    • IXR1
    • CIN5
    • BAS1
    • YAP6
    • SOK2
    • MCM1
    • ZAP1
    • RLM1
    • YHP1
    • RDS1
    • HSF1
    • SPT20
  • Network with just p-value < 0.05 wouldn't load on GRNsight (too many edges)
  • Created network with just B-H p-value < 0.05

27 October 2015

  • From yeastract: there were 58 significant genes
    • Δcin5, Δgln3, Δhap4, Δhmo1, and Δswi4 were added to the network
  • Generate Regulation Matrix for "Only DNA binding evidence"
    • In dropbox folder: RegulationMatrix_dGLN3_ONLY_20151027.xlsx
  • There are no unconnected genes in the network.
  • Network has 63 genes and 332 edges RegulationMatrix_dGLN3_ONLY_20151027
  • Delete transcription factors with a yeastract p-value in the E-05
    • ROX1, YFL052W, RDS1, CAC2, YAP6, CBF1, SSN2, INO4, FLO8, HFI1
    • Are not regulated: ABF1, OPI1
    • Do not regulate: CYC8, FLO11, CSE2, SNF5, BAS1, RIF1, SNF2, SNF6, MGA2, SPT20, IXR1, SUT2, OPI1, KAR4, ARR1, YLR278C, CST6, ASG1, SAS5
  • OPI1 has no connections - delete
    • 52 genes 238 edgesRegulationMatrix_dGLN3_ONLY-2_20151027
  • Delete transcription factors with a yeastract p-value in the E-07
    • SOK2, IXR1, SUT2, SPT23, TUP1, RLM1, KAR4, ADR1, ARR1, AFT1, RAP1, YLR278C, CST6, ASG1, MIG2, SAS5
    • Are not regulated: ZAP1, ABF1, BAS1, SPT20
    • Do not regulate: CYC8, FLO11, MSN4, CSE2, SNF5, BAS1, RIF1, SNF2, SNF6, MGA2, SPT20
  • BAS1 and SPT20 have no connections - delete
  • Network has 35 genes and 120 edges RegulationMatrix_dGLN3_ONLY-3_20151027
  • Delete transcription factors with a yeastract p-value in the E-08
    • SKO1, GCN4, CRZ1, MGA2, STE12, HSF1
    • Are not regulated: GCR2, ZAP1, CSE2, ABF1
    • Do not regulate: CYC8, FLO11, MSN4, CSE2, SNF5, RIF1, SNF2, SNF6, GLN3
  • CSE2 has no connections - delete
    • 27 genes 79 edges RegulationMatrix_dGLN3_ONLY-4_20151027
      • GLN3 is only controlled by YHP1 (also by GCN4, AFT1, RAP1 in first network). YHP1 is controlled by MSN2, TEC1, FKH2, ASH1, MCM1, CIN5, SWI4.
  • Keep the 10 most significant genes from yeastract and CIN5, GLN3, HAP4, HMO1, SWI4 for RegulationMatrix_dGLN3_ONLY-4_20151027
      • 15 genes 28 edges
      • ACE2 and ZAP1 are only connected to each other when this is done. ASH1 is necessary for keeping ZAP1 connected to the network.
  • RegulationMatrix_dGLN3_ONLY-6_20151027 16 genes 32 edges (add ASH1)
  • RegulationMatrix_dGLN3_ONLY-7_20151027 delete all genes with p-values greater than 6.0759E-11
    • CSE2 has no connections - delete
    • 22 nodes and 60 edges

28 October 2015

  • Do one with regardless of delation and one with deletion stains added.
  • Create a network of 35 transcription factors (for both)
    • Delete gene by gene
    • Record number of "regardless" and "added deletion"
  • Regardless
    • took top 35 - SPT23 is unconnected; delete that and take the one with one greater p-value --> RegulationMatrix_dGLN3_ONLY-1_regard_20151027
    • RegulationMatrix_dGLN3_ONLY-2_regard_20151027
      • BAS1 and SPT20 are unconnected so they are deleted
      • 32 genes 97 edges
    • RegulationMatrix_dGLN3_ONLY-3_regard_20151027
      • 31 nodes 92 edges
    • RegulationMatrix_dGLN3_ONLY-4_regard_20151027
      • 30 nodes 85 edges
    • RegulationMatrix_dGLN3_ONLY-5_regard_20151027
      • 29 nodes 71 edges
    • RegulationMatrix_dGLN3_ONLY-6_regard_20151027
      • RIF1 and CSE2 are unconnected so they are deleted
      • 26 nodes 67 edges
    • RegulationMatrix_dGLN3_ONLY-7_regard_20151027
      • 25 nodes 52 edges
    • RegulationMatrix_dGLN3_ONLY-8_regard_20151027
      • 24 nodes 51 edges
    • RegulationMatrix_dGLN3_ONLY-9_regard_20151027
      • 23 nodes 48 edges
    • RegulationMatrix_dGLN3_ONLY-10_regard_20151027
      • 22 nodes 46 edges
    • RegulationMatrix_dGLN3_ONLY-11_regard_20151027
      • 21 nodes 41 edges
    • RegulationMatrix_dGLN3_ONLY-12_regard_20151027
      • 20 nodes 39 edges
    • RegulationMatrix_dGLN3_ONLY-13_regard_20151027
      • 19 nodes 38 edges
    • RegulationMatrix_dGLN3_ONLY-14_regard_20151027
      • 18 nodes 32 edges
    • RegulationMatrix_dGLN3_ONLY-15_regard_20151027
      • 17 nodes 28 edges
    • RegulationMatrix_dGLN3_ONLY-16_regard_20151027
      • 16 nodes 24 edges
    • RegulationMatrix_dGLN3_ONLY-17_regard_20151027
      • ACE2 and ZAP1 are unconnected so they are deleted
      • 13 nodes 20 edges
  • Adding transcription factors
    • took top 35 - SPT23 is unconnected; delete that and take the one with one greater p-value --> RegulationMatrix_dGLN3_ONLY-1_add_20151027
    • RegulationMatrix_dGLN3_ONLY-2_add_20151027
      • Start deleting after the added transcription factors
      • BAS1 and CSE2 are unconnected so they are deleted
      • 32 genes 115 edges
    • RegulationMatrix_dGLN3_ONLY-3_add_20151027
      • 31 genes 97 edges
    • RegulationMatrix_dGLN3_ONLY-4_add_20151027
      • 30 genes 95 edges
    • RegulationMatrix_dGLN3_ONLY-5_add_20151027
      • 29 genes 92 edges
    • RegulationMatrix_dGLN3_ONLY-6_add_20151027
      • 28 genes 86 edges
    • RegulationMatrix_dGLN3_ONLY-7_add_20151027
      • 27 genes 79 edges
    • RegulationMatrix_dGLN3_ONLY-8_add_20151027
      • 26 genes 76 edges
    • RegulationMatrix_dGLN3_ONLY-9_add_20151027
      • 25 genes 75 edges
    • RegulationMatrix_dGLN3_ONLY-10_add_20151027
      • 24 genes 74 edges
    • RegulationMatrix_dGLN3_ONLY-11_add_20151027
      • 23 genes 66 edges
    • RegulationMatrix_dGLN3_ONLY-12_add_20151027
      • 22 genes 60 edges
    • RegulationMatrix_dGLN3_ONLY-13_add_20151027
      • 21 genes 56 edges
    • RegulationMatrix_dGLN3_ONLY-14_add_20151027
      • ACE2 and ZAP1 are no longer connected to the network - delete
      • 18 genes 50 edges
    • RegulationMatrix_dGLN3_ONLY-15_add_20151027
      • 17 genes 44 edges
    • RegulationMatrix_dGLN3_ONLY-16_add_20151027
      • 16 genes 43 edges
    • RegulationMatrix_dGLN3_ONLY-17_add_20151027
      • 15 genes 33 edges

18 November 2015

  • Tasks:
    • Create input sheets
    • Begin with the largest. Ask Kayla, Natalie, or Grace to show us how to use access
    • Put in expression data for all strains except SWI4
  • Creating input sheets:
    1. production_rates sheet:
      • Two columns "id", "production_rate".
    2. degradation_rates sheet:
      • Two columns "id", "degradation_rate".
    3. Expression Data Sheets for Individual Yeast Strains
      • Each strain will have its own sheet in the workbook.
      • Each sheet should be given a unique name that follows the convention "STRAIN_log2_expression", where the word "STRAIN" is replaced by the strain designation, which will appear in the optimization_diagnostics sheet. The sheet should have the following columns in this order:
        • "id": list of all genes.
        • expression data for each gene at a given timepoint given as log2 ratios (log2 fold changes) with the column headers being the time at which the data were collected, without any units for the 15, 30, and 60 minutes of cold shock.
        • Replace #VALUE! so there is nothing in the cells
    4. network sheet
      • Add in networks. Ask about using Access.
    5. network_weights sheet: identical to the network sheet.
    6. optimization_parameters sheet
      1. alpha: Penalty term weighting (from the L-curve analysis)
      2. kk_max: Number of times to re-run the optimization loop. In some cases re-starting the optimization loop can improve performance of the estimation.
      3. MaxIter: Number of times MATLAB iterates through the optimization scheme. If this is set too low, MATLAB will stop before the parameters are optimized.
      4. TolFun: How different two least squares evaluations should be before the program determines that it is not making any improvement
      5. MaxFunEval: maximum number of times the program will evaluate the least squares cost
      6. TolX: How close successive least squares cost evaluations should be before the program determines that it is not making any improvement.
      7. Sigmoid: =1 if sigmoidal model, =0 if Michaelis-Menten model
      8. estimate_params =1 if want to estimate parameters and =0 if the user wants to do just one forward run
      9. make_graphs =1 to output graphs; =0 to not output graphs
      10. fix_P =1 if the user does not want to estimate the production rate, P, parameter, just use the initial guess and never change; =0 to estimate
      11. fix_b =1 if the user does not want to estimate the b parameter, just use the initial guess and never change; =0 to estimate
      12. expression_timepoints: A row containing a list of the time points when the data was collected experimentally. Should correspond to the timepoint column headers in the STRAIN_log2_expression sheets.
      13. Strain: A row containing a list of all of the strains for which there is expression data in the workbook. Should correspond to the "STRAIN" portion of the names of the STRAIN_log2_expression sheets for each strain.
      14. simulation_timepoints: A row containing a list of the time points at which to evaluate the differential equations to generate the simulated data. This does not need to correspond to the actual measurement times, but should be in the same units (e.g. minutes).
    7. threshold_b sheet
      • Two columns: "id" and list the standard names for the genes in the model in the same order as in the other sheets. The second column should contain the initial guesses, typically all 0.
  • Save Excel as Input_Sheet_Skeleton_TM_2015-11-18.xlsx
  • Find out procedure for using Microsoft Access

Notes for Microsoft Access

  • Database is a glorified spreadsheet program. Allows for persistent storage of data - can look like excel. It has a more complex structure and behaviors.
  • Editing a database table saves it automatically. Do not edit data directly. Be careful.
  • When working with database program, it automatically saves file. Makes a default filename and location.
  • Gene IDs become primary key. For our database to work properly, you need to have a primary key.
  • Database tables, conceptually break tables down (similar to tabs in excel).
  • Use database to write a query. Can relate table by primary key system. Mathematical formulation.
  • Use database to pull list of data out for us in a simple query.
  1. Open Microsoft Access, chose a blank database and select a name & location.
  2. External Data
    • Import side: Excel and choose normalized data
    • When prompted chose the Rounded_Normalized_Data keep all defaults until getting to the primary key.
    • Choose primary key for the ID field
    • Name of table should be Rounded_Normalized_Data
      • Do not need to save import steps
      • Import error: NAs were turned into space characters, must go back into Excel and delete spaces.
        • Go back to Master_Sheet and turn space into nothing and try again
  3. Double click Master_Sheet
  4. Make another Table - copy and paste in genes from networks
  5. Go Create > Query Design, add both tables
    • Match fields that have the same names - drag ID to ID
    • Right click on line -> Join properties (choose option 2 "Include all records from Table 1 and only...")
    • Choose what data you want to bring
  6. Click "Make Table" and Run
    • Table Name "Query_Results"
  7. Can do two things
    1. Copy and Paste into Excel
    2. External Data -> Export to Excel

30 November 2015

7 December 2015

Format Input Sheets:

  • Delete genes that have no data in any of the fields.
    • Network was reduced to 14 genes:
  • Leave production_rates and degradation_rates sheets blank
  • Optimization parameters should be formatted:
    • alpha should be 0.01
    • kk_max should be 1
    • MaxIter should be 1e08 (one hundred million in plain English)
    • TolFun should be 1e-6
    • MaxFunEval should be 1e08 (one hundred million in plain English)
    • TolX should be 1e-6
    • Sigmoid should be 1
    • estimateParams should be 1
    • makeGraphs should be 1
    • fix_P should be 0
    • fix_b should be 1
    • For the parameter "time" (Cell A13), we should have "15", "30", and "60", since these are the timepoints we have in our data.
    • For the parameter "Strain" (Cell A14), replace "dcin5" with the name of the second strain you are using, making sure that the capitalizaiton and spelling is the same as what you named the worksheet containing that strain's expression data. We are only going to compare two strains, so you can delete the other strain information.
    • For the parameter "Sheet" (Cell A15), give the number of the worksheet from left to right that your "Strain" log2 expression data is in. Delete any extra numbers because we are only comparing two strains.
    • For the parameter "Deletion", leave the zero in cell B15 (corresponding to wt). In cell C15, put a number corresponding to the position in the list of gene names that the gene that was deleted appears. In the sample file, CIN5 is number 3 in the list. Note, disregard the column header in this count and only consider the actual gene names themselves.
    • For the parameter, "simtime", you perform the forward simulation of the expression in five minute increments from 0 to 60 minutes. Thus, this row should read: simtime should be 0, 5, <...fill by steps of 5...>, 60, each number in a different cell.
  • For "network_b" paste in the list of standard names for your transcription factors with "rows genes affected/cols genes controlling" as the column header. The "threshold" value for each gene should be "0".

Input Sheet

Running the model

  • Download master branch of GRNmap from github and save to desktop (7-zip -> Extract here).
  • Open MATLAB and navigate to the "GRNmap-master" folder and open the "matlab" subfolder. Double-click on the file "GRNmodel.m"
  • Click on the green triangle "Run" button to run the model.
  • When prompted, select the input file.
  • Code was run on the MATLAB gave the following error:

Index exceeds matrix dimensions.
Error in readInputSheet (line 166)
   log2FC(i).deletion  = Deletion(i);
Error in GRNmodel (line 30)
GRNstruct = readInputSheet(GRNstruct);

15 January 2016

Notes from Lab Meeting:

  • Task is to format the input sheets.
  • Alphabetize the genes, making use of the transpose feature.
  • Makes sure all are complete in terms of degradation rates and production rates.
  • Put in the average for missing values in the log2 expression
    • Make sure they are colored and paste special values in the text.

20 January 2016

  1. Alphabetize genes:
    • For log2 expression data: Select data, press the Sort button, then Custom Sort.
    • For the network, alphabetize, then transpose, alphabetize, and transpose again.
  2. Finish setting up the input sheet.
    • Paste your list of transcription factors from your "network" sheet into the column named "StandardName". You will need to look up the "SystematicName" of your genes. YEASTRACT has a feature that will allow you to paste your list of standard names in to retrieve the systematic names here.
    • The degredation rates have been calculated from protein half-life data from a paper by Belle et al. (2006). Look up the rates for your transcription factors from this file and include them in your "degradation_rates" worksheet.
    • The production rates are calculated by multiplying the degradation rates by 2.
    • Guidelines for the optimization parameters sheet is here
  3. Plug in averages for missing data
    • Select the data, under Format, select Conditional Formatting, and set it to highlight blanks in yellow.
    • In the empty cells type =AVERAGE() and highlight the data for the other data points for that time point.
    • The highlight will disappear as soon as there is anything in the cells, so make sure to rehighlight.
    • Copy all of the data and paste special values.
    • Reselect the data, right click, and set the number to have 4 points after the decimal.


I had the following questions:

  1. If there isn't any data in any of the four replicates for one of the time points, should we delete that gene? For example, if there isn't any data for any of the four 15 time points for a gene in one of the log2 expression sheets, should we delete that gene for the network?
  2. For my network where we did not add in CIN5, GLN3, HAP4, HMO1, SWI4, and ZAP1, my network only had ZAP1 and did not have GLN3 which is the deletion strain I am focusing on. Should I add in GLN3, or focus my attention to the network where CIN5, GLN3, HAP4, HMO1, SWI4, and ZAP1 were added.

21 January 2016

Dr. Dahlquist answered my questions on github:

  1. If there isn't any data in any of the four replicates for one of the time points, should we delete that gene? For example, if there isn't any data for any of the four 15 time points for a gene in one of the log2 expression sheets, should we delete that gene for the network?
    • Answer: you should either delete that gene, or alternately, you could consider deleting that strain's expression data and keeping that gene (you could try both).
  2. For my network where we did not add in CIN5, GLN3, HAP4, HMO1, SWI4, and ZAP1, my network only had ZAP1 and did not have GLN3 which is the deletion strain I am focusing on. Should I add in GLN3, or focus my attention to the network where CIN5, GLN3, HAP4, HMO1, SWI4, and ZAP1 were added.
    • Answer: don't add GLN3 to that family of networks if it was not there originally; that is what the other family is for. Only use the strain data that are appropriate for that particular network. For example, if HAP4 is not part of the network, then you shouldn't use the expression data for that strain.


There are now three different networks to pair down:

  1. dGLN3_35-gene_120-edges_add_input_TM_2016-01-20_delete-genes
    • Delete CRZ1 and FLO11
    • Version 5 is identical to 4 (originally had CRZ1 deleted)
    • Version 14 removes ZAP1 from the network so delete the log2 expression data for ZAP1
  2. dGLN3_35-gene_120-edges_add_input_TM_2016-01-21_delete-expression-data
    • delete the log 2 expression for HMO1 and HAP4
    • Version 14 removes ZAP1 from the network so delete the log2 expression data for ZAP1
  3. dGLN3_35-gene_120-edges_disregard_input_TM_2015-01-21
    • Version 17 removes ZAP1 from the network so delete the log2 expression data for ZAP1

22 January 2016

Week 2 Meeting Notes:

  • Coders still have work to do, but it should be finalized next week so we can run models on the beta branch.
  • Run the model

29 January 2016

Week 3 Meeting Notes:

  • TRACE documentation - ongoing
  • Format input sheets
  • Actually run L-curve analysis
  • Do four runs: largest and smallest network ± the deletions (likely 6 for me).
  • LSE is on y-axis and penalty on x-axis. This information is on the optimization_diagnostics sheet in the output sheet.
  • Label alpha for each point - tells us which alpha to chose.
  • Pull data out of Excel file and then plot on excel.
  • Make it so make_graphs equals 1 for the L-curve analysis.
  • Coders are going to do a release on master soon, so we should use master after this week.

1 February 2016

3 February 2016

8 April 2016

Over the following weeks compile:

  • LSE/Min LSE table
  • LSE vs total parameters plot
  • In- and Out- Degree distribution
  • Figure out common genes

15 April 2016

  • Created LSE/Min LSE table, LSE vs total parameters plot, and In- and Out- Degree distribution
  • To figure out common genes Grace, Kristen, and I created a google sheet to compare
  • Need to Email Dr. D about the conference.

Dahlquist Lab Navigation