This journal entry is due on Tuesday, April 19 at midnight PDT (Monday night/Tuesday morning). NOTE new due date and that the server records the time as Eastern Daylight Time (EDT). Therefore, midnight will register as 03:00.
Individual Journal Assignment
- Store this journal entry as "username Week 13" (i.e., this is the text to place between the square brackets when you link to this page).
- Create the following set of links. (HINT: you can do all of this easily by adding them to your template and then using the template on your pages.)
- Link to your journal entry from your user page.
- Link back from your journal entry to your user page.
- Link to this assignment from your journal entry.
- Don't forget to add the "BIOL398-01/S11" category to the end of your wiki page.
We have done a great deal of statistical data analysis to arrive at this point: the construction of a dynamic model of gene regulation. So far, we have
- Processed microarray data to identify genes responding to the experimental condition
- Grouped genes into similarly acting clusters.
- Identified transcription factors involved in the activation and repression of those clusters.
- Created a network defining the affectors and affectees in this regulation process.
These steps, as you will recall from the last week's work, involve a lot of computation, and a lot of "sub steps," some of which was done "by hand" in excel (or matlab), and some of which was done through software packages. We have one more package with which to work, one that a team of people (faculty and students) has put together. You will find it on Dr. Fitzpatricks lionshare as code_2011_04_18_biomath_modeling.zip.
When you unzip this file, you will find a number of matlab functions, a script called estimation_driver.m, and an example input excel file. The functions we will leave alone, but the driver script and excel file we will edit and change.
The math model we've discussed in class requires a few things.
- Initial states of the transcription factor levels.
- Production and degradation rate constants.
- The interaction network, defined by the adjacency matrix of affector and affectee genes.
- The weights of the affector gene levels on the affectee.
- The thresholds at which the switches turn "on" and "off."
The problem of determining the weights and thresholds is a (nonlinear) regression problem that requires some experimental data to which we compare the model. Thus, we also need some measurement data.
- Times at which observations are made.
- Log 2 fold change information to be used as gene expression levels.
- Log 2 fold change error bars.
We will make these requirements into specific tasks below.
Structure of the Excel file for Gene Regulatory Network Parameter Identification
The excel file must have the following sheets
The format of these sheets must be (like the input for STEM) exactly like that of the example excel file.
In addition, there are some sheets you can include if desired.
- optimization parameters
Building Your Own Input File (i.e., your homework assignment)
- Important: you must be consistent in listing your genes. The rows of the sheets of the degradation rates and network matrices need to have the genes in the same order and with the same case (lower, upper). Please take care to organize your genes consistently across input sheets!
- Start with log2_concentrations. Fill in the data for the genes you've identified as being in your network. This may be a tedious bit of cut-and-paste. The data you need is the average log2 fold change for each gene of interest and each time point. You may find it easiest to use your previous (large) workbook from Week 11, adding a copy of your final sheet and editing that down to the genes of interest.
- The log2_concentrations sheet has a header row. Columns A and B contain the gene names, the systematic name (e.g., YDR259C) followed by the standard name (e.g., YAP6). Entries in column C and beyond of that header row are the times of the data collection. The format is that the rows represent genes while the columns represent the time course. For the Schade data, use only the short term data (not the 12 hour and longer). For the Dahlquist Lab data, use only the cold shock and not the recovery.
- Now, let's tackle the concentration_sigmas. This sheet follows the same format of the log2_concentrations sheet, except that the data we enter are the standard deviations of the log2 fold changes for each gene at each time point. We will extract this data from the Week 11 sheet as well.
- Next, move to the network sheet. The rows are the affectees, while the columns are affectors. This format is the transpose of the output from your YEASTRACT data. Copy the YEASTRACT network, but when you paste it into your input excel workbook, use "paste transpose" (or "paste special" selecting "transpose" depending on your excel version). This sheet also requires a header row and a header column, in which we label the genes by their standard names (e.g., YAP6, GLN3).
- The network_weights sheet is an initial guess, as we discussed in class. Thus, we put our best shot in there. Lacking other information, we can make them all ones for starters.
- The network_thresholds sheet is also an initial guess, as we discussed in class. Thus, we put our best shot in there. Lacking other information, we can make them all ones for starters.
- Degradation rates are harder. My lionshare account also includes an excel workbook Belle_PNAS_06_SuppDataSet_with_abs.xls. This file contains degradation rates for a number of the proteins translated from the transcription factors. The data is in the sheet named ranked-and-averaged. Find your genes there, and convert half-life to rate (lambda = LN(2)/half_life). This data must be entered into the spreadsheet degradation_rates, which also has two columns of gene identifiers, just like the log2_concentrations sheet, as well as a header row.
- The production_rates are actually determined inside the software. You should delete the production_rates sheet if you have one in your input file.
- Save this workbook in the same folder as the matlab files.
- Share this file with your instructors on lionshare. Due to the sensitive nature of the unpublished data, take care not to leave data files in public places such as openwetware or lab computers.
Running the Parameter Identification software (i.e., part 2 of your homework assignment)
- open matlab and change directories to your working directory.
- edit the input_file_name line to include the excel file you've created, using the full path.
- run the script. it will take awhile.
- the script creates an excel file with estimated weights and thresholds. Examine these values. Which genes activate? Which repress? Are any particularly strong or weak? Examine the threshold values. How do you interpret these results?
Copy and paste the resulting plots that matlab generates to a powerpoint, and upload that powerpoint to your individual assignment page.
- Store your journal entry in the shared Class Journal Week 13 page. If this page does not exist yet, go ahead and create it (congratulations on getting in first :) )
- Link to your journal entry from your user page.
- Link back from the journal entry to your user page.
- Sign your portion of the journal with the standard wiki signature shortcut (
- Add the "BIOL398-01/S11" category to the end of the wiki page (if someone has not already done so).
- What aspect of this assignment came most easily to you?
- What aspect of this assignment was the most challenging for you?
- What (yet) do you not understand?
- Are there any aspects of the model structure with which you don't feel comfortable? Explain.