Nika Vafadari Week 10

Electronic Lab Notebook Week 10

Purpose

To analyze the journal article by Vu and Vohradsky, which discusses the use of a non linear model in identifying the regulators of target genes associated with the cell cycle in yeast and correctly determining their function as an activator or repressor.

Preparation for jornal Club 2

Journal Club Presentation Slides

Biological/ Mathematical Terms

Putative: something that is generally accepted or assumed Source
Proteomic: the entire complement of proteins of an organism or cell Source
Nonlinear differential equation: unlike linear equation power of equation is not one Source
Specificity of prediction: the proportion of negatives that are observed that were in fact predicted to be negatives Source
Levenberg-Marquardt method: standard method used for solving non linear least squares problems through minimization of the sum of the square of errors Source
Parallelized: to make a program/ adapt it to run on a processing system that is parallel thus similar to it Source
Gene Regulatory Network: A grouping or set of genes that are found to work together in controlling a particular function in a cell Source
Algorithm: the set of formulas (usually in a set or specific order) or complete procedure used for solving a problem Source
Co-ordinatley Compared Genes: Refers to the eukaryotic genes which have their own promoter and control elements. These genes do not need to be next to each other but have the same control elements Source
Histogram: graphical representation that focuses on relaying the shape or underlying distribution of a continuous set of data Source

Outline

Main Result Presented in Paper

Goal: use the approximation/ estimation of the expression profile of target gene to pick set of potential regulators for that specific target gene
identify true regulators by determining which ones best model the profile of the target gene
check for correct identification of regulators by comparing results to YEASTRACT database
used a non linear model as opposed to a linear one

Found that non linear model correctly identifies the regulators of target genes associated with the cell cycle in yeast/ determines their functions as activator or repressor
compared to linear model, non linear model give better results in terms of correct identification of the regulator and better fit to gene expression profile of target

(further discussed in discussion section below)

Significance of Work

Introduction

Regulation of gene expression

DNA (gene) —> RNA molecules and functional proteins
For regulation to occur—> transcriptional regulatory proteins must recognize promotor sequence in order to bind RNA polymerase and start transcription reaction to environmental stress & cell development can lead to modification/change in gene expression

DNA microarrays

visualize and record change in gene expression over the time (duration of cell cycle/developmental processes) by looking at the change in mRNA levels
to understand relationship between regulators/ target genes and the network interactions that result
these networks are what lead to changes in the amount of RNA/ changes in gene expression

Cell cycle control

well studied in Saccharomyces cerevisiae leading to the establishment of large transcriptomic databases that include the changes in RNA synthesis throughout the cell cycle
Goal: collect microarray gene expression data (genome wide) pertaining to cell cycle in yeast—> analyze through clustering methods —> identify cell cycle controlled genes
based on assumption that the expression profile of a specific target gene is the result of action of upstream regulator (transcriptional regulator)

Methods

Previous methods

grouped genes based on which promoter the transcriptional regulators bind to instead of similarity in pattern of gene expression
identified potential networks through the use of differential equations in order to develop a generalized linear model to predict the pattern of transcription of a specific gene
- Wolf and Wang: used fuzzy logic
- Nachman et al.: used dynamic Bayesian networks with a kinematic model
- Bar-Joseph: used gene expression analysis and genomic info alongside one another
- Wang et al. and Makita et al.: extending the work of Bar-Joseph, incorporated promoter sequence analysis into gene expression analysis

Alternative method presented in paper

replaces linear model with model using nonlinear differential equation
- Procedure applied
  - starts by selecting set of potential regulators —> 184 chosen
  - select set of specific target genes within S. cerevisiae —> 40 genes selected
  - select genes from within the set of potential regulators to apply model to in order to see if it fits the gene expression profile of the specified gene correctly
  - repeat for selected target genes and potential regulators
  - determine true regulators by identifying regulators that model the profile of the target gene correctly

Results

Dynamic model of transcriptional control

Model Assumptions
- the relationship/ interaction between regulators and target genes is repeated over time
- combinatorial control by regulators exists/causes change in gene expression of a target gene

Equation 1

represents target gene expression level z at time t + dt
g= regulatory effect for specific gene
yj= expression levels of regulators at time t
wj= regulatory weights of genes controlling target gene
j= 1, 2,…,m ( m= number of regulators controlling gene)
b= parameter that represents delay of transcription initiation/ unspecified bias from regulatory effects of gene expression

Equation 2

rate of expression of a target gene (dz/dt)
p= regulatory effect g of regulators j transformed by sigmoidal transfer function (regulatory effects of other genes)
x= effect of degradation (x=k.z.)

Equation 3

models control of target gene expression z
k2= rate constant of degradation of product of target gene
k1= maximal rate of expression

Equation 4

simplified version of equation 3 that only considers one transcriptional factor
y= approximated with a polynomial of degree n (shown below in equation 5)

Equation 5

used least squares minimization procedure to calculate coefficients {a0, . . , an} from experimental gene expression profile
assumed same weight of error for all measured points
degree n= must be chosen individually for each experiment based on number of data points in profile and fluctuation due to experimentation to reflect rate of changes in gene expression within the specific experiment (for yeast cell cycle chose n=6)

Equation 6

to identify the gene profiles that minimize the mean/average square error function
Z= z(t) = expression profiles of target gene at time points t for Q data points (1,2,…,Q)
Y= y(t)= expression profiles of regulator gene at time points t for Q data points
Q= number of data points (data points calculated by equation 4 model)
z^c(t)= reconstructed profiles of Z at time points t for Q data points

Equation 7

In order to minimize error function (equation 6), data was entered into both nonlinear model and linear model to compare the calculated parameters (w, b, k1, k2) represented by d

Computational algorithm

Goal: use the approximation/ estimation of expression profile of target gene to pick set of potential regulators for a specific target gene
potential regulators for specific target gene were chosen from pool of 184 transcriptional regulators using equation 4/ least square minimization and minimizing error function (equation 6)
estimated regulator gene profiles using polynomial of degree n to avoid both missing data points and fluctuating gene expression profiles due to experimentation
solved equation 4 numerically using ode45 function in MATLAB (Runge–Kutta procedure) and optimized parameters with least squares minimization loop (using standard Levenberg– Marquardt procedure)
repeated 100 times for each regulator-target combo with random values for parameters at start of optimization
results compared to linear model using equation 7 in place of equation 4

Overall algorithm:

use equation 5 to fit regulator genes with polynomial of degree n
choose specific target gene
choose possible regulator from large pool of potential ones
use least squares minimization (equation 4) on target gene/ regulators & error function (equation 6)
repeat starting at step 3 for all potential regulators
choose best fit regulators that match criteria
repeat for all the target genes starting at step 2

Dataset selection

used eukaryotic cell cycle dataset (Spellman et al.) to evaluate model which includes..
- gene expression changes at 18 time points over 2 cell cycle periods
- 6178 open reading frames on microarray chip
- identified 800 genes associated with cell cycle according to their expression
however, number of regulators controlling cell cycle < 800, therefore pool of 184 chosen by researchers for this experiment based on YEASTRACT database and previous papers
chose same 40 target genes from Chen et al. paper to compare data

Inference of regulators

data put into log base 2 of ratio (actual value of the mRNA over standard value that was the same for every time point)
prior to use of algorithm —> data squared/ least squares minimization applied to target genes for potential regulator

Equation 8

estimates unknown real gene expression profile of target gene
approximation takes into account error due experiment/ natural fluctuations by polynomial fit rather than statistical model, which could have been used

Equation 9

calculates the deviation from experimental data

most likely potential regulator determined by identifying the regulator profile that best models the target profile (eq. 4) and minimizes error (eq. 6)

therefore, chose regulators with an error (E) from eq. 6 less than or equal to the deviation (E1) computed using eq. 9

the small the value of E the better the regulator fits the target gene profile

best regulators= those that have a recognizably smaller E than other regulators therefore fitting the target profile better

Table 1

complete list of 40 target genes with summary of correct identification, indicating which regulators for specific target genes were actually confirmed to be regulators for that gene in the YEASTRACT database (contains current, but incomplete knowledge)
- m= total number of regulators identified based on criteria
- min (m)= which number regulator the first correctly identified one was in the list of regulators for that target gene (listed from smallest to largest E value)
- min (m) lin= based on linear model instead of non linear
- E= defined by equation 6 (non linear model eq. 4 & linear model eq. 7)

Key results drawn from Table 1

out of the best regulators —> 35% correctly identified

false positive (FP) rate -which is the amount of predicted regulators not in database (ratio of false positive identified regulators over total amount of potential regulators) = low of 1.1%

when E < E1 used as criterion —> 37.5% correctly identified but rate of FP double

two other methods E < 1.1 * E1 & E < 1.2 * E1 —> found to increase correct identification percentage, but also increase FP rate

4 target genes found to cause most of the increases in FP rate: YOR323C, YJL155C, YDR285W, YAL018C
- due to fact that almost any of the regulators could model their profiles
- in addition, these 4 profiles displayed high fluctuations, therefore having large E1 values (deviation from experimental data)
- due to difficulty in evaluation data of 4 genes excluded

specificity of prediction (Sp = (N - FP)/N) —> indicates number of experiments necessary to verify results of algorithm
- N= total number of potential regulators

looking at min (m) —> most correctly identified regulators identified within first 5 regulators in list
- for 90% of target genes (36/40) correctly identified regulator identified within first 10 regulators in list
- all target genes had at least one correctly identified regulator identified within first 21 in list

Compared sign of weight of regulators (+ = activator, - = repressor) to YEASTRACT database
- 77.8% of regulators correctly identified as activators or repressors

Sources of inaccuracies/ error in prediction of function (activator vs repressor) vs actual function
- YEASTRACT database incomplete
- high error due to experimental noise
- least squares minimization (used to calculate target gene expression profile) could have not reached optimal solution, meaning the actual regulator could have not been identified
- attempted to reduce this issue by testing different initial parameter values through several runs of least squares minimization for set that gave best solutions (repeated 100 times for best estimation of profile of specific target gene)

Figure 1

displays expression profiles of target gene, best fitting regulator, reconstructed target gene for 12 genes that are regulated by the cell cycle
- A group: repressors
- B group: activators
- x-axis: 18 time points
- y:axis: expression relative to time point zero
- graph name lists target gene then regulator

repressors- opposite trend as target genes
activators- same trend as target gene

Comparison with linear model

compared results of non linear model to linear one (eq. 7) as shown in Table 1

regulators listed from smallest to largest value of E (eq. 6)
min (m)= which number regulator the first correctly identified one was in the list of regulators
- Figure 2 = histogram of distribution of these values
- A (left) non linear model
- B (right) linear model

non linear model seems to have better fit (one degree better) & better results

when error of measurement in microarray experiment set at 10%, meaning E value of regulator should be within 10% of E=0.0268
- 9 genes excluded for non linear model & 37 excluded for linear model

Discussion

used an algorithm based on a non linear model that models all possible regulator-target gene combos and determines which regulators best fit the gene expression profile of that specified gene correctly
true regulators determined by comparing results to YEASTRACT database
found that non linear model correctly identifies the regulators of target genes associated with the cell cycle in yeast and correctly determines their function (activator or repressor)
compared to linear model, non linear model give better results in terms of correct identification of the regulator and better fit to gene expression profile of target.
- linear model gave lowest fit and lowest prediction ability when compared to both non linear model and model presented in Chen et. al (generalized linear model)
when 3 models were compared (non linear, linear, Chen et. al) found that all 3 gave different results for the sets of genes
shows how models are often created to focus on specific case or aspect
- for example, Chen et. al found to have improved fit since it included more regulators at each step, however this occurred at the cost of accuracy when it came to predicting non-documented regulators for a target

the non linear model overall showed good accuracy and reasonable fit
- identified correct regulators and their functions

since the model captures the behavior of transcriptional regulation/ provides info on influence of possible regulators and correctly predicts regulators, it can act as a useful tool in interpreting gene expression time series

however, large scale network may require a large number of computations that may be unrealistic
- requiring improvements in speed and incorporation of additional independent genome-wide location data, DNA sequence data, etc.

Other drawbacks/ limitations still exist
- since possible regulators are chosen from larger pool of ones defined independently for example through sequence analysis for similarity, regulators may remain unidentified
- genes that aren't involved in regulation may have expression profiles that satisfy the minimization criterion therefore FPs from this would have to be sorted out independently
- the involvement of a regulator could me masked by inhibition of its activity due to another regulator that may not be in the set of potential regulators leading to false negative
- proteins that mediate transcriptional regulation arise from post-translational modification, therefore use of proteomic data may be more appropriate instead of the microarray data

Acknowledgments

Various sites referenced linked and referenced next to each term were used to define the biological terms.
The equations, figures and key info pertaining to the construction of the model and execution of the experiment, such as definition of variables, were copied and pasted from the article referenced below.
Except for what is noted above, this individual journal entry was completed by me and not copied from another source.
- Nika Vafadari 05:24, 27 March 2017 (EDT):

References

Dahlquist, Kam D. (2017) BIOL398-05/S17:Week 10. Retrieved from http://www.openwetware.org/wiki/BIOL398-05/S17:Week_10 on 27 March 2017.
Vu, T. T., & Vohradsky, J. (2007). Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae. Nucleic acids research, 35(1), 279-287. doi: 10.1093/nar/gkl1001

Useful Links

Nika Vafadari
Course Home Page
Weekly Journal Entries
Shared Journal Pages
Assignment Pages
- Week 1
- Week 2
- Week 3
- Week 4
- Week 5
- Week 6
- Week 7
- Week 9
- Week 10
- Week 11
- Week 12
- Week 14/15
Template:Nika Vafadari

Nika Vafadari Week 10

Contents

Electronic Lab Notebook Week 10

Purpose

Preparation for jornal Club 2

Biological/ Mathematical Terms

Outline

Main Result Presented in Paper

Significance of Work

Introduction

Methods

Results

Discussion

Acknowledgments

References

Useful Links

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools