Review of Existing Network Inference Algorithms
Biological network inference algorithms aim to recover an underlying network topology from available data. In the context of gene expression, the data commonly consists of time series measurements from microarray experiments. Most published network inference algorithms fall into four main categories. To learn about each in detail, follow this LINK TO A PROPER LITERATURE REVIEW WITH 10 METHODS!. Each type of method has its relative merits and limitations, as summarised below.
- Clustering methods: These look for patterns of co-expression between genes, group these accordingly. They are limited in their inability to infer precise gene-gene interactions, and would not be suitable standalone methods when precision in network inference is required.
- Bayesian Networks: These commonly perform better than clustering algorithms at describing processes of locally interacting components, such as genes. The statistical foundations of bayesian methods are well understood, and they are excellent for describing causal relationships. The main focus of this project will be to compare bayesian methods to a robust control theory approach. Therefore, this type of network will be revisited.
- Information Theory: These methods compute measures of mutual information between genes to assess possible interactions according to statistical significance. Their main drawback is that they require large datasets in order to be able to assess statistical significance, which may not be ideal in cases where we seek a refined view of the structure of our network.
- ODEs: These require a constraint on the structure of biological information. They are not ideal when we wish to compute network structure de-novo.
While most methods fall into a subset of these four general classes, this project focuses on a subset of algorithms that can cope with "unmeasured" or "hidden states". The next section focuses on these in more detail.
Network Inference with Hidden Variables
Hidden variables in gene-gene interaction networks are quantities that cannot be directly measured from a gene expression profiling experiment. These commonly refer to:
- genes not included in the microarray
- regulatory proteins that have not been measured
- mRNA and protein degradation
Most methods reviewed in the previous section lack the ability to cope with unobserved quantities. In this project, we focused on comparing three methods:
Rangel et al. 2004, Beal et al. 2005 and Goncalves and Warnick 2008.
Rangel and Beal focused on linear dynamical systems (ref) to model gene expression data. They then applied a bayesian inference approach to recover network structure from time series data, employing slightly different procedures.
- Rangel's approach focuses on a "classical approach" of parameter inference using log likelihood methods, kalman smoothing and the expectation maximization algorithm.
- Beal's approach has the same assumptions, but instead, employs a variational "treatment" where the aim is to obtain a "posterior" distribution of parameters, rather than a point estimate.
Goncalves and Warnick employ a robust control theory approach to network inference, where they are able to recover boolean network structure from steady state gene expression measurements.
To understand these in more detail, the methods section contains an explanation of how these work. Including a description of the inputs and outputs that the algorithms can take and the mathematical assumptions.
The aims of this project are to compare the relative algorithmic performance of the aforementioned methods. Here we focus on comparing the algorithmic performance on the basis of:
- Recovery of correct network structure in the presence of noise
- Recovery of correct network structure in the presence of non-linearities
- Recovery of correct network structure as a function of number of experiments
- Recovery of correct network structure as a function of the number of time points
- Recovery of correct network structure with real datasets
The comparison should serve as guide for the user who wishes to select a suitable method to analyse data. The method by Goncalves and Warnick requires the datasets to be in a particular input format (explained in methods), therefore, the comparison in this project was performed using synthetic datasets generated from ODE models.