IGEM:IMPERIAL/2008/Modelling/Tutorial2

=Pre-Analysis Data Processing=

The outcome of the motility assay is a movie, in the form of a sequence of frames. A bacteria-tracking algorithm can be applied to these frames. For the moment we will not question the output to the tracking algorithm. The times and positions it yields are therefore considered exact.

In reality, limitations on the tracking of motile B. subtilis include
 * Determining the centre of the organism as it moves in order to define a co-ordinate for the position,
 * Frame-rate of video capture - can we capture all the relevant movements the organism makes?
 * Resolution of the images - are the images high enough resolution to allow us to accurately resolve the location of the organism?
 * Errors in the algorithm analysing the images

In order to analyse the data, they must be broken into manageable subsets
 * Design a routine to convert the trajectory of a bacterium into the relevant times, angle, velocities etc…
 * Apply your routine to your synthetic dataset(s)
 * Crude Error Checking: Plot the relevant histograms and compare them to histograms of the synthetic dataset(s) generated in Tutorial 1 (they should be the same)

With the kind of motion mentioned in Tutorial 1, the trajectory can be split into four sets of independent data (time, velocity, angle, turning time). However, we can expect velocity and time to be linked.

How can we expect velocity and time to be linked if we are assuming normal distribution of a uniform velocity? Are we assuming that the more towards the centre of the normal distribution the run-time is, the more towards the centre of the normal distribution the velocity will be?

The turning time is not really recorded in the output from our model.


 * In the likely event they are indeed linked, modify the pre-analysis data processing – and the error checking – accordingly.

=Parameter Estimation=

In mathematical modelling, hypotheses about the process of interest (in this case, the motility of B. subtilis) are stated as parametric families of probability distributions, called models (see Tutorial 1).

The goal of modelling is to make discoveries about the underlying process, by testing these hypotheses.

Once a model, together with its parameters, has been specified, and data has been gathered, the model can be evlaluated.

The parameters of a statistical distribution can be estimated from a set of samples in many different ways. The most common way is to apply an estimator to the distribution, in order to extract its parameters.

Maximum Likelihood
The concept of likelihood is related to the more familiar concept of probability. The best explanation that I found online is here.

Maximum likelihood estimators of some standard distributions are given below.



The experimental data we obtain does not give us access to the entire underlying distribution. We hope that the data are representative of the underlying distribution. The amount of data we use to estimate the parameters is therefore a crucial factor in the accuracy of the outcome. By applying the relevant estimator to the synthetic dataset we generated, we can see that increasing the size of the data set increases the accuracy with which we can estimate the parameter. The order of the data set does not influence the likelihood. Advantages and disadvantages of the MLE approach to parameter estimation are summarized here

Moments
Another possible method is to recombine the moments of the distribution in such a way that the outcome is a parameter of the distribution. The nth moment of a distribution is defined by: $$\mu_n^'=\langle x^n \rangle$$. Take a look at this site for detailed explanations. Centred moment at n=1 is defined as the mean of the distribution. By taking moments with respect to the mean, we can obtain the shape of the graph with respect to the average of the distribution. This is convenient for common distributions such as the Gaussian and Maxwell-Boltzmann distributions, among many others. Take a look at this site for more details on central moments. Centred moments of 2nd order is the Variance, 3rd order refers to the Skewness and 4th order refers to the Kurtosis of the distribution. This site gives an outline on how to derive moments using the Moment-Generating Function.
 * What is the definition of mn (the nth moment of a distribution)?
 * Often we prefer centred moment for n>1. What is their definition?
 * How do we call the centred moments of second, third and fourth order?
 * What do they measure?
 * 1) Variance is given by: $$\sigma^2=\int P(x)(x-\mu)^2 dx$$. It gives a measure of statistical dispersion (degree of being spread out), averaging the squared distance of its possible values from the mean.
 * 2) Skewness is the third standardized moment and is defined as: $$\gamma_1 = \frac{\mu_3}{\sigma^3}$$. It is a measure of the degree of asymmetry of a distribution. If the left tail is more pronounced (elongated) than the right tail, the function is said to have negative skewness. If the reverse is true, it has positive skewness. If the two are equal, it has zero skewness. This site gives a table of skewness for common distributions.
 * 3) Kurtosis is the degree of peakedness of a distribution and is defined as: $$ \frac{\mu_4}{\sigma^4}\ $$. A high kurtosis distribution has a sharper "peak" and fatter "tails", while a low kurtosis distribution has a more rounded peak with wider "shoulders". This site provides good examples of various distributions with positive and negative kurtosis.

Recombining the moments $$m_1 \dots m_n$$ in order to estimate the parameters $$\alpha_1\dots\alpha_p$$ consists of identifying the functions $$g_1 \dots g_p$$ such that
 * Apply the method to the Gaussian, Exponential, Poisson distributions…
 * The functions g1… gp are not unique - see the case of the exponential distribution. How do you think the discrepancies between the various formulae can be exploited?

Again we never have access to the entire distributions, only to data that we think are representative of the distribution. The amount of data we use to estimate the parameters is again crucial for the accuracy of the outcome.
 * Calculate the first four moments with various amounts of data
 * For instance use 10 samples, 50 samples, 100 samples
 * Does the order of your data have a significant influence?
 * Apply the recombination to the computed sample-moments of distributions of your choice
 * Compare the accuracy of the approach to the results obtained with ML.
 * Discuss the pros and cons of the method

= Bayesian Analysis=

Bayes Theorem
Bayes theorem states that $$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$$ where $$P(A|B)$$ is termed the posterior probability, and the terms on the right hand side are the prior probabilities.
 * What is Bayes’ update rule? What is its meaning?

=References=
 * 1) Rice Maximum likelihood estimation of Rician distribution parameters (1998) Jan Sijbers, Arnold J. Den Dekker, Paul Scheunders, Dirk Van Dyck. IEEE Transactions on Medical Imaging