User:Nuri Purswani/Network/Introduction

From OpenWetWare
Jump to navigationJump to search
Algorithms for Biological Network Reconstruction from data

Home Introduction Synthetic Datasets Methods Results Discussion References
Introduction

Motivation

Inferring gene-regulatory network topologies from microarray expression analysis is a challenging task. Commonly used algorithms include Bayesian Methods, Information theory and ODEs (Bansal et al. 2006, 2007; Stumpf et al. 2009, 2010; Sontag et al. 2004, 2005). These are classed as “reverse engineering” methods, and theoretically prove to be effective in biological datasets. However, current network reconstruction algorithms have two major limitations:

  1. At present experimental data is not sufficient to account for elements of variation, including: hidden states, noisy data, nonlinear dynamics… hence in many cases, algorithms yield poor results (Bansal et al. 2006).
  2. The task becomes harder as the number of nodes in the network increases (Bansal et al. 2006).

In this project, we compare an algorithm developed by Dr Stan in collaboration with Dr Gonçalves (Stan et al. 2009, 2010, Gonçalves et al. 2008,2009) to the method of Rangel and Beal (Rangel et al. 2004, Beal et al. 2005). The robust control approach proposes that to fully reconstruct a network, when nothing is known prior to the start of the experiment:

  • If there are p chemical species present in the network, we must perform p measurements (Stan,Gonçalves et al. 2008-2010).
  • Each input in the experiment must independently control a node/measured species.

As a consequence, in the absence of sufficient data the algorithm can only provide a “coarse” representation of the network. In this case, conclusions on network topology are only drawn from measured states, independently of the number of hidden (unobservable) states. If more data becomes available, the structure can be refined (Stan,Gonçalves et al.). This approach is crucial for obtaining real understanding of network topology. Without experimentation, every network structure is possible and this can end up yielding meaningless results. Furthermore, existing results in literature do not exploit the understanding of necessary experiment design for network reconstruction. This method can be extended beyond the in silico implementation to become a debugging tool for synthetic biology.

Main Types of Network Inference Algorithms

Most published network inference algorithms fall into four main categories:

  • Clustering methods: These look for patterns of co-expression between genes, group these accordingly. They are limited in their inability to infer precise gene-gene interactions, and would not be suitable standalone methods when precision in network inference is required.
  • Bayesian Networks: These commonly perform better than clustering algorithms at describing processes of locally interacting components, such as genes. The statistical foundations of bayesian methods are well understood, and they are excellent for describing causal relationships. The main focus of this project will be to compare bayesian methods to a robust control theory approach. Therefore, this type of network will be revisited.
  • Information Theory: These methods compute measures of mutual information between genes to assess possible interactions according to statistical significance. Their main drawback is that they require large datasets in order to be able to assess statistical significance, which may not be ideal in cases where we seek a refined view of the structure of our network.
  • ODEs: These require a constraint on the structure of biological information. They are not ideal when we wish to compute network structure de-novo.

While most methods fall into these four general classes, this project focuses on a subset of algorithms that can cope with "unmeasured" or "hidden states". The next subsection focuses on these in more detail.

Network Inference with Hidden Variables

Hidden variables in gene-gene interaction networks are quantities that cannot be directly measured from a gene expression profiling experiment. These commonly refer to:

  • Genes not included in the biological experiment
  • Other proteins (e.g. regulatory) that cannot be measured directly
  • Effects of mRNA and protein degradation...

Most methods reviewed in the previous section lack the ability to cope with unobserved quantities. In this project, we focused on comparing a subset of methods that can cope with these "hidden variables". In particular: Rangel et al. 2004, Beal et al. 2005 and Gonçalves and Warnick 2008. Rangel and Beal focused on linear dynamical systems (Rangel et al. 2004, Beal et al. 2005) to model gene expression data. They then applied a bayesian inference approach to recover network structure from time series , employing slightly different procedures:

  • Rangel's approach focuses on a "classical approach" of parameter inference using log likelihood methods, kalman smoothing and the expectation maximization algorithm. It required many repeats of the datasets and from there, it would start a bootstrapping procedure to infer parameters from the given data.
  • Beal's approach has the same assumptions, but instead, he employs a variational "treatment" where the aim is to obtain a "posterior" distribution" of parameters, rather than a point estimate. See methods for a more extensive discussion on this.

Gonçalves,Warnick and Stan employ a robust control theory approach to network inference, where they are able to recover boolean network structure from steady state gene expression measurements. Results in literature have previously shown that Beal's method is superior to Rangel's in the inference of network parameters (Beal et al. 2007), therefore, after having contacted the authors, Beal's method was chosen over Rangel's for the direct comparison between Gonçalves and Warnick's. We use two naming conventions:

  1. The method of Gonçalves et al. is referred to as the robust control method
  2. The method of Beal et al. is referred to as the bayesian inference method

To understand these two algorithms in more detail, the Methods section contains a more in-depth explanation of how these work. Including a description of the inputs and outputs that the algorithms can take and the mathematical assumptions. In addition, we have provided a Literature Review of the methods that are interesting for comparison. We did not get to implement all of them in this project.

Project Aims

The aims of this project are to compare the relative algorithmic performance of the Bayesian inference and the Robust control approaches. The criteria we utilised to asess algorithmic performance included:

  • Ability to recover the correct network structure as a function of signal to noise ratio.
  • Ability to recover the correct network structure in the presence of non-linearities.
  • Ability to recover sub-topologies in complex networks with feedback loops.
  • Ability to recover network structure as a function of the input perturbation.
  • Noise tolerance of both methods as a function of increasing number of experimental repeats.
  • Comparison of the types of inputs that each algorithm requires.

The comparison should serve as a guide to the user, who may wish to select a suitable method to analyse his/her data. The method by Gonçalves and Warnick requires the datasets to be in a particular input format (see Methods) that is not available experimentally. For this reason, we have based the comparisons on in silico implementations although future plans of the group are to test their algorithm on the DREAM datasets, so it is still work in progress.