Natalie Williams Week 10

From OpenWetWare
Revision as of 23:22, 23 March 2015 by Natalie Williams (talk | contribs) (→‎Further Questions: Comment on data analysis process)
Jump to navigationJump to search

Outline of Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae

Introduction

  • Gene regulation makes a working copy of the genetic information of DNA sequences into proteins and/or functional RNAs.
    • Promoting regions must be recognized by transcription regulatory proteins which bind RNA polymerase to the DNA strand.
  • Microarray developments have made it easier to follow the changes of the cell's gene expression over time.
    • Analyzing this microarray data, we could better understand the relationships between genes and their transcription factor regulators.
    • Because these relationships collectively form a network among the genes, it should be possible to construct networks by studying the results of microarray data.
  • Budding yeast, Saccharomyces cerevisiae, has been studied extensively in the lab.
    • There is a lot of knowledge about its genome.
    • Expression data was collected and analyzed to figure out what genes were being used at a specific stage of the cell cycle.
    • Genes were grouped based on where their regulators bound to promoter regions.
  • Methods in which networks were produced previously:
    • A generalized linear model was going to be created to described regulators and guess the pattern of regulators and their target genes.
    • A kinetic model with Bayesian networks was used to predict gene regulatory networks as well as the proteins that regulate genes expression.
    • Including both information from the genome and gene expression data named another method to predicting networks.
      • Another research furthered this method by using promoter regions or the sigma factor.
  • An alternative method used in this paper:
    • A model based on nonlinear differential equation model was used.
      • It called for all potential regulators
      • Genes from a group of potential regulators are picked and the model is applied to try to fit the gene expression results of the target genes.
      • This is done for all potential regulators
  • In this model:
    • There were 40 target genes;
    • 184 possible regulators were identified;
    • The data were analyzed using a linear model; and,
    • Results from the linear model were compared to that of the nonlinear differential equation system to see how well it predicted the target genes' profiles.

Results

Dynamic model of transcriptional control

  • For the model, an assumption that there is repeated interactions between regulators and target genes over time.
    • The model also assumes there is combinatorial control among the regulators for target genes.

Equation 1

  • yj: expression level regulators
  • wj: regulatory weights
  • g: regulator effect of a specific gene
  • j =1,2,...m, where m is number of regulators controlling a gene
  • b: parameter for transcription initiation delay/unspecific bias caused by regulator effects associated with gene expression

Rate of expression of target gene (dz/dt) is given by regulatory effects of other genes ρ & the effect of degradation x.
Equation2

  • Degradation is shown with a first order chemical reaction --> x = k*z
  • ρ = regulatory effect g of regulators transformed by a sigmoidal transfer

The entire model for control of target gene expression z:
Equation 3

  • k2: rate of degradation of target gene product
  • k1: rate of expression

However, Equation 3 can be simplified to Equation 4
Equation 4

  • y is approximated with a polynomial of degree n

Approximation of y

  • Coefficients were taken from experimental gene expression data using a least squares approximation.
  • An assumption that all the weight errors for all points were the same.
  • The simplified version - Equation 4 - was used to figure out regulators of the target genes
  • n has to be chosen to represent the large amounts of changes in gene expression for each individual experiment

These expression profiles Z {z(t)} for the target and Y {y(t)} for regulating genes measure at time points ranging from 1,2...Q were used to look at and analyze the gene profiles to minimize the average square error.
Equation 6

  • {z^c(t)}: altered profile of z(t) for all Z at time points t=1,2,...Q,
  • Q: data points calculated from Equation 4

The issue began to focus on how to get the best results with the minimal amount of error.
The Linear model was then compared to the nonlinear model.

  • The parameters (d) came from the minimization of errors in function 6.

Computational algorithm

  • Regulators for target genes are being chosen to predict the profiles of the target genes by using the pool of 184 potential regulators
    • Equations 4 and 6 were used
  • Potential missing experimental data is added into the method by using the polynomial of degree n, with n representing the number of data points and level of expression change

The algorithm used is as follows:

  1. Fit regulators with Equation 5
  2. Select a target gene
  3. Potential regulatory gene is chosen
  4. The least squares minimization for target and regulator genes was then applied
  5. Step 3 is repeated for all potential regulators
  6. Regulators that best fit the selection are then picked out
  7. Step 2 and then all following steps are repeated until this method has been done for all target genes
  • The above algorithm was done 100 times for each pair of regulator and target gene.
  • Optimization was based off the LEvenberg-Marquardt method & Equation 4 was solved with ode45 in MATLAB.

Dataset selection

To validate their model, Vu and Vohradsky compared their results to microarray data from Spellman.

  • 6178 open reading fames were on the chip.
  • The amount of regulators was smaller for influencing the cell cycle.

The 184 possible regulators was extracted from YEASTRACT and other published papers
The 40 target genes were selected from Chen et al's work.

Inference of regulators

The data were in the form of log base 2 ratios between actual values of mRNA divided by value of a standard.Before analyzing the data, each expression was squared then underwent the least squares minimization procedure for all target-regulator relationship.
Equation 8 is used to approximate the unknown real expression pattern for each gene due to potential error in the execution and results of the experiment as well as the nature of the biological processes that occur.

  • Through multiple measurements and the use of polynomial fit, a statistical model can predict the overall error.
    • However, the polynomial fit test was still going to be used

Equation 9 shows the expression of how to obtain the deviation of the model from the data. Minimization of error (Eq 6) and finding the target’s expression as a response to its regulator’s profile (Eq 4) will help identify the best regulator-target pairs.

  • Pairs for target and regulator were chosen based on their error being less than the deviation obtained from Eq 9.
  • In the plots, the smaller E’s (squared errors) signify that the regulator fits the profile of the target gene better than other regulators.
  • A comparison was made between the selected pairs that had the best profiles with the YEASTRACT database.
  • If there was a match, then the pair was labeled as correct.

Table 1 shows the target genes and which regulator fit an individual target best as well as the errors associated with the pair.

  • In Table 1, 35% of the regulator-target relationships were described as correct and best-fitted.

The average false positive was low.

  • ’min(m)’ shows that most targets’ regulators were identified within the first 5 tests done from the regulator pool; 90% of targets had their regulators identified within the first 10 tests from the pool.
  • The false positive was found as the relationship between regulators that were identified as false positives and total number of potential regulators (184).
  • The false positives in this paper are defined as those regulator-target pairs that were not found on the YEASTRACT database
  • The false positives were centered on a select few target genes because their profiles were easily influenced by many, almost all, regulators.

The specificity of prediction is Sp = (N - FP)/N where N is the number of potential regulators and FP is false positives.

  • This equation is used to find the number of experiments needed to be done to verify what was seen as the results from the algorithms.

Regulators can either activate or repress the target gene depending on the sign of the weight (+ activates while - represses)

  • In comparing the algorithm’s predictions to YEASTRACT around 75% of the time it correctly identified a regulator as an activator or repressor.

There were many sources of error, including:

  • YEASTRACT not having all of the information regarding genes and their relationship to other genes
  • Experimental noise
  • Use of least squares altering the degree to which the parameters fall together to achieve the optimized/maximum value
    • The algorithm/procedure was run 100 times with arbitrary initial values as well as the parameters being singled out to achieve the best profile.

Comparison with linear model

Linear Model shown in Eq 7

  • Table 1 shows results from the linear model
  • Figure 2 has the histograms to compare the nonlinear and linear models’ minimum for how many regulators were tested before the correct profile was given
  • In Table 1, the error function shows that the fit is one degree better for the nonlinear model than the linear
  • Best fit from the nonlinear algorithm was also compared to YEASTRACT as well as another paper by Chen et al.
    • No matches occurred between the predictions of Chen et al and only 5/40 were correct for the nonlinear model.

Discussion

  • An algorithm (nonlinear model) that assumed that the target gene's profile is a result of a specific regulator was used to help model the cell cycle.
  • The observed, experimental data was compared to the output of the model.
    • This difference was minimized with the least squares function (Eq 6).
  • The pairs that had the best relationship in the model similar to the data were chosen; the algorithm correctly identified around 40% of the pairs - verified with YEASTRACT.
  • This model takes into account all the possible connections between target genes and their potential regulators.
    • To run at a larger scale instead of 40 targets and 184 potential regulators, the algorithm would parallelized.
  • The nonlinear model predicts and fits the data better than the linear model.
  • Comparing all three results (Chen et al, nonlinear, and linear), different connections and data sets were produced from each of them.
  • This study focused on a simplified case/gene regulatory networks.
    • Because it is simplified, the data used to compile and create this network came from previous studies; however, with not all the information gathered about particular genes - target and regulatory - there could be some false inferences taken from what was observed.
    • The study also did not include in the regulatory pool the target genes themselves, which could have skewed the results. Some target genes may regulate other target genes or itself.
  • Models cannot be applied to an overall organism under all conditions, but under specific environments or testing for isolated responses.
  • The algorithm constructed for this paper can be used to figure out the transcription network in other organisms and their gene regulation.

Conclusions

  • The focus of this study was to describe and understand the relationship between the target genes and their regulators and to understand the basic transcriptional regulation of those genes.
    • It can correctly identify whether genes are activators or repressors.
    • It also helps determine the strength regulators have on their target genes.

Further Questions

  • The main result of this article is that the alternative model could accurately and correctly identify pairs of target genes and their regulators. The algorithm could also distinguish between activating and repressing regulators.
  • The importance of this work shows that there are many methods in figuring out the connectivity and relationships between genes and their regulators. It also suggests that under various conditions, different pathways will have different effects on the cell as well as the results seen from the various models. This article highlights that although individual studies can be done and those researchers can verify their data, a compilation of all the data benefits and enlightens us, the science community, about the processes and mechanisms set up within yeast cells.
  • The methods are described above in the outline. The statistical/mathematical analysis used is described in the outline above.
    • I did not see direct relationships between some of the questions regarding the methods, such as how the cells were treated or what the control was; therefore, I did not answer those questions.
    • 40 target genes were studied with 184 potential regulators. There was no overlap in the two pools of genes such that the relationship between two target genes was not analyzed or considered in the experiment.
    • Least square minimization was used to help fit and standardized the data.
    • Each target-regulator pair was repeated 100 times with varying initial values to receive different parameters that would help with finding the optimized fit.

Figures and Tables

Table 1 lists in numerical order all 40 target genes.

  • The next column, best, lists the best regulators for those target genes. The best regulators had the lowest E (squared error) values.
  • In the column after best, the amount of regulators increases due to the loosening of constraints on the E value. This occurs for the next two columns where E1 is multiplied by 1.1 and then 1.2.
  • The two columns that state min(m) list the position of the first correctly identified regulators with E in increasing order. The two columns compare the nonlinear model to the linear model.
  • The last two columns list the lowest E values achieved from the nonlinear model and then the linear model.

Figure 1 has two sections - A and B, where A has the graphs for the repressors and B the activators.

  • X-axis: the 18 time points at which the data was taken/model was simulated
  • Y-axis: the expression relative to time point zero
  • The names are in the order target/regulator
  • The symbols, x's, represent the target gene profile from the data
  • The dotted line represents the reconstructed or simulated output profile of that target gene
  • The solid line is the regulator that best fit the data for the target gene
  • The figures show the best fitting regulator the its specified target gene.
    • Repressors have the opposite curves/inverted relationship with its target gene. This makes sense because when a gene's repressor is high, the expression level of that target gene is expected to fall and decrease.
    • Activators have the same curve/line shape as its target gene. When an activator's presence sits on the DNA strand longer, it attracts and calls for more transcription of that specific mRNA to make more of its product.

Figure 2 is a histogram of the order of correctly identified regulators in the sorted list and min(m).

  • A has the results from the nonlinear model
  • B has the observations recorded from the linear model
  • Comparing the two histograms, on can see that the nonlinear model better predicted the regulators of genes with a smaller pool. Its highest value fell somewhere between 19 and 22.
  • For B, the highest min(m) value was 35 which means that for the linear model more runs had to be done to correctly match a target gene and its regulator from the model.

Vocabulary

  1. Co-ordinately controlled genes: genes whose expressions can be turned on or off due to the same stressor or signal Source
  2. Singular value decomposition: description of data in terms of scores of each individual measurement or comparison or organism along single values and loadings; like a matrix with eigenvalues and eigenvectors Source
  3. Putative: describes something that is generally accepted or inferred without direct proof Source
  4. Nonlinear differential equation: the degree or power of the equation is not one Source
    • I completely forgot what the actual definition of this term/equation was, but reading it, I realize that it was very simple.
  5. Recursive: an operation that produces itself from a given equation Source
  6. Reconstructed profile: an overall prediction of what a specific gene's expression levels were over a period of time; reconstructed means that from the algorithm modeled its fit after what was input Source
  7. Levenberg-Marquardt procedure: finds the minimum of a function that is the sum of the square of nonlinear functions Source
  8. Specificity of prediction: the proportion of the true negative outcomes that are correctly predicted to be negative Source
  9. Parallelized: adaptation of a program to run on similar processing systems Source
  10. Proteomic: the study of proteins synthesized by cells or certain organisms Source


Back to User Page: User:Natalie Williams
To view the Course and Assignments:BIOL398-04/S15