User:Timothee Flutre/Notebook/Postdoc/2012/01/02: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
m (→‎Learn about the multivariate Normal and matrix calculus: mention Sbar_N-1 + minor improvements)
Line 72: Line 72:
<span style="color:#FF0000">At this step in the book, I don't understand how we go from the line above to the line below:</span>
<span style="color:#FF0000">At this step in the book, I don't understand how we go from the line above to the line below:</span>


<math>d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - n\Sigma) \Sigma^{-1} + (d\mu)^T \Sigma^{-1} \sum_{i=1}^N (x_i - \mu)</math>
<math>d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - N\Sigma) \Sigma^{-1} + (d\mu)^T \Sigma^{-1} \sum_{i=1}^N (x_i - \mu)</math>


<math>d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - n\Sigma) \Sigma^{-1} + n (d\mu)^T \Sigma^{-1} (\bar{x} - \mu)</math>
<math>d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - N\Sigma) \Sigma^{-1} + N (d\mu)^T \Sigma^{-1} (\bar{x} - \mu)</math>


The first-order conditions (ie. when <math>d l(\theta) = 0</math>) are:
The first-order conditions (ie. when <math>d l(\theta) = 0</math>) are:


<math>\hat{\Sigma}^{-1} (Z - n\hat{\Sigma}) \hat{\Sigma}^{-1} = 0</math> and <math>\hat{\Sigma}^{-1} (\bar{x} - \hat{\mu}) = 0</math>
<math>\hat{\Sigma}^{-1} (Z - N\hat{\Sigma}) \hat{\Sigma}^{-1} = 0</math> and <math>\hat{\Sigma}^{-1} (\bar{x} - \hat{\mu}) = 0</math>


From which follow:
From which follow:
Line 86: Line 86:
and:
and:


<math>\hat{\Sigma} = \bar{S}_N = \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T</math>  
<math>\hat{\Sigma} = \bar{S}_N = \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T</math>
 
Note that <math>\bar{S}_N</math> is a biased estimate of <math>\Sigma</math>. It is usually better to use the unbiased estimate <math>\bar{S}_{N-1}</math>.


* '''Sufficient statistics''':
* '''Sufficient statistics''':
Line 102: Line 104:
Thus, by employing the same trick with the trace as above, we now have:
Thus, by employing the same trick with the trace as above, we now have:


<math>L(\mu, \Sigma) = (2 \pi)^{-NP/2} |\Sigma|^{-N/2} exp \left( -\frac{N}{2}(\bar{x} - \mu)^T\Sigma^{-1}(\bar{x} - \mu) -\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1} \right)</math>
<math>L(\mu, \Sigma) = (2 \pi)^{-NP/2} |\Sigma|^{-N/2} exp \left( -\frac{N}{2}(\bar{x} - \mu)^T\Sigma^{-1}(\bar{x} - \mu) -\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1}) \right)</math>


The likelihood depends on the samples only through the pair <math>(\bar{x}, \bar{S}_{N-1})</math>. Thanks to the [http://en.wikipedia.org/wiki/Sufficient_statistic#Fisher.E2.80.93Neyman_factorization_theorem Factorization theorem], we can say that this pair of values is a sufficient statistics for <math>(\mu, \Sigma)</math>.
The likelihood depends on the samples only through the pair <math>(\bar{x}, \bar{S}_{N-1})</math>. Thanks to the [http://en.wikipedia.org/wiki/Sufficient_statistic#Fisher.E2.80.93Neyman_factorization_theorem Factorization theorem], we can say that this pair of values is a [http://en.wikipedia.org/wiki/Sufficient_statistic sufficient statistic] for <math>(\mu, \Sigma)</math>.


We can also transform a bit more the formula of the likelihood in order to find the distribution of the sufficient statistics:
We can also transform a bit more the formula of the likelihood in order to find the distribution of this sufficient statistic:


<math>L(\mu, \Sigma) = (2 \pi)^{-(N-1)P/2} (2 \pi)^{-P/2} |\Sigma|^{-1/2} exp \left( -\frac{1}{2}(\bar{x} - \mu)^T(\frac{1}{N}\Sigma)^{-1}(\bar{x} - \mu) \right) \times |\Sigma|^{-(N-1)/2} exp \left(-\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1}) \right)</math>
<math>L(\mu, \Sigma) = (2 \pi)^{-(N-1)P/2} (2 \pi)^{-P/2} |\Sigma|^{-1/2} exp \left( -\frac{1}{2}(\bar{x} - \mu)^T(\frac{1}{N}\Sigma)^{-1}(\bar{x} - \mu) \right) \times |\Sigma|^{-(N-1)/2} exp \left(-\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1}) \right)</math>

Revision as of 09:23, 13 April 2012

Project name <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
<html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

Learn about the multivariate Normal and matrix calculus

(Caution, this is my own quick-and-dirty tutorial, see the references at the end for presentations by professional statisticians.)

  • Motivation: when we measure items, we often have to measure several properties for each item. For instance, we extract a sample of cells from each individual in our study, and we measure the expression level of all genes in the sample. We hence have, for each individual, a vector of measurements (one per gene), which leads us to the world of multivariate statistics.
  • Data: we have N observations, noted [math]\displaystyle{ X = (x_1, x_2, ..., x_N) }[/math], each being of dimension [math]\displaystyle{ P }[/math]. This means that each [math]\displaystyle{ x_i }[/math] is a vector belonging to [math]\displaystyle{ \mathbb{R}^P }[/math].
  • Model: we suppose that the [math]\displaystyle{ x_i }[/math] are independent and identically distributed according to a multivariate Normal distribution [math]\displaystyle{ N_P(\mu, \Sigma) }[/math]. [math]\displaystyle{ \mu }[/math] is the P-dimensional mean vector, and [math]\displaystyle{ \Sigma }[/math] the PxP covariance matrix. If [math]\displaystyle{ \Sigma }[/math] is positive definite (which we will assume), the density function for a given x is: [math]\displaystyle{ f(x|\mu,\Sigma) = (2 \pi)^{-P/2} |\Sigma|^{-1/2} exp(-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)) }[/math], with [math]\displaystyle{ |M| }[/math] denoting the determinant of a matrix and [math]\displaystyle{ M^T }[/math] its transpose.
  • Likelihood: as usual, we will start by writing down the likelihood of the data, the parameters being [math]\displaystyle{ \theta=(\mu,\Sigma) }[/math]:

[math]\displaystyle{ L(\theta) = f(X|\theta) }[/math]

As the observations are independent:

[math]\displaystyle{ L(\theta) = \prod_{i=1}^N f(x_i | \theta) }[/math]

It is easier to work with the log-likelihood:

[math]\displaystyle{ l(\theta) = ln(L(\theta)) = \sum_{i=1}^N ln( f(x_i | \theta) ) }[/math]

[math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) }[/math]

  • ML estimation: as usual, to find the maximum-likelihood estimates of the parameters, we need to derive the log-likelihood with respect to each parameter, and then take the values of the parameters at which the log-likelihood is zero. However, in the case of multivariate distributions, this requires knowing a bit of matrix calculus, which is not always straightforward...
  • Matrix calculus: some useful formulas

[math]\displaystyle{ d(f(u)) = f'(u) du }[/math], eg. useful here: [math]\displaystyle{ d(ln(|\Sigma|)) = |\Sigma|^{-1} d(|\Sigma|) }[/math]

[math]\displaystyle{ d(|U|) = |U| tr(U^{-1} dU) }[/math]

[math]\displaystyle{ d(U^{-1}) = - U^{-1} (dU) U^{-1} }[/math]

  • Technical details: from Magnus and Neudecker (third edition, Part Six, Chapter 15, Section 3, p.353). First, they re-write the log-likelihood, noting that [math]\displaystyle{ (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) }[/math] is a scalar, ie. a 1x1 matrix, and is therefore equal to its trace:

[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) ) }[/math]

As the trace is invariant under cyclic permutations ([math]\displaystyle{ tr(ABC) = tr(BCA) = tr(CAB) }[/math]):

[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math]

The trace is also a linear map ([math]\displaystyle{ tr(A+B) = tr(A) + tr(B) }[/math]):

[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = tr( \sum_{i=1}^N \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math]

And finally:

[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = tr( \Sigma^{-1} \sum_{i=1}^N (x_i-\mu) (x_i-\mu)^T ) }[/math]

As a result:

[math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} tr(\Sigma^{-1} Z) }[/math] with [math]\displaystyle{ Z=\sum_{i=1}^N(x_i-\mu)(x_i-\mu)^T }[/math]

We can now write the first differential of the log-likelihood:

[math]\displaystyle{ d l(\theta) = - \frac{N}{2} d(ln(|\Sigma|)) - \frac{1}{2} d(tr(\Sigma^{-1} Z)) }[/math]

[math]\displaystyle{ d l(\theta) = - \frac{N}{2} |\Sigma|^{-1} d(|\Sigma|) - \frac{1}{2} tr(d(\Sigma^{-1}Z)) }[/math]

[math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) - \frac{1}{2} tr(d(\Sigma^{-1})Z) - \frac{1}{2} tr(\Sigma^{-1} dZ) }[/math]

[math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) + \frac{1}{2} tr(\Sigma^{-1} (d\Sigma) \Sigma^{-1} Z) + \frac{1}{2} tr(\Sigma^{-1} (\sum_{i=1}^N (x_i - \mu) (d\mu)^T + \sum_{i=1}^N (d\mu) (x_i - \mu)^T)) }[/math]

At this step in the book, I don't understand how we go from the line above to the line below:

[math]\displaystyle{ d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - N\Sigma) \Sigma^{-1} + (d\mu)^T \Sigma^{-1} \sum_{i=1}^N (x_i - \mu) }[/math]

[math]\displaystyle{ d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - N\Sigma) \Sigma^{-1} + N (d\mu)^T \Sigma^{-1} (\bar{x} - \mu) }[/math]

The first-order conditions (ie. when [math]\displaystyle{ d l(\theta) = 0 }[/math]) are:

[math]\displaystyle{ \hat{\Sigma}^{-1} (Z - N\hat{\Sigma}) \hat{\Sigma}^{-1} = 0 }[/math] and [math]\displaystyle{ \hat{\Sigma}^{-1} (\bar{x} - \hat{\mu}) = 0 }[/math]

From which follow:

[math]\displaystyle{ \hat{\mu} = \bar{x} = \frac{1}{N} \sum_{i=1}^N x_i }[/math]

and:

[math]\displaystyle{ \hat{\Sigma} = \bar{S}_N = \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T }[/math]

Note that [math]\displaystyle{ \bar{S}_N }[/math] is a biased estimate of [math]\displaystyle{ \Sigma }[/math]. It is usually better to use the unbiased estimate [math]\displaystyle{ \bar{S}_{N-1} }[/math].

  • Sufficient statistics:

From [math]\displaystyle{ Z }[/math] defined above and [math]\displaystyle{ \bar{S}_{N-1} }[/math] defined similarly as [math]\displaystyle{ \bar{S}_N }[/math] but with [math]\displaystyle{ N-1 }[/math] in the denominator, we can write the following:

[math]\displaystyle{ Z = \sum_{i=1}^N (x_i-\mu)(x_i-\mu)^T }[/math]

[math]\displaystyle{ Z = \sum_{i=1}^N (x_i - \bar{x} + \bar{x} - \mu)(x_i - \bar{x} + \bar{x} - \mu)^T }[/math]

[math]\displaystyle{ Z = \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T + \sum_{i=1}^N (\bar{x} - \mu)(\bar{x} - \mu)^T + \sum_{i=1}^N (x_i - \bar{x})(\bar{x} - \mu)^T + \sum_{i=1}^N (\bar{x} - \mu)(x_i - \bar{x})^T }[/math]

[math]\displaystyle{ Z = (N-1) \bar{S}_{N-1} + N (\bar{x} - \mu)(\bar{x} - \mu)^T }[/math]

Thus, by employing the same trick with the trace as above, we now have:

[math]\displaystyle{ L(\mu, \Sigma) = (2 \pi)^{-NP/2} |\Sigma|^{-N/2} exp \left( -\frac{N}{2}(\bar{x} - \mu)^T\Sigma^{-1}(\bar{x} - \mu) -\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1}) \right) }[/math]

The likelihood depends on the samples only through the pair [math]\displaystyle{ (\bar{x}, \bar{S}_{N-1}) }[/math]. Thanks to the Factorization theorem, we can say that this pair of values is a sufficient statistic for [math]\displaystyle{ (\mu, \Sigma) }[/math].

We can also transform a bit more the formula of the likelihood in order to find the distribution of this sufficient statistic:

[math]\displaystyle{ L(\mu, \Sigma) = (2 \pi)^{-(N-1)P/2} (2 \pi)^{-P/2} |\Sigma|^{-1/2} exp \left( -\frac{1}{2}(\bar{x} - \mu)^T(\frac{1}{N}\Sigma)^{-1}(\bar{x} - \mu) \right) \times |\Sigma|^{-(N-1)/2} exp \left(-\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1}) \right) }[/math]

[math]\displaystyle{ L(\mu, \Sigma) \propto N_P(\bar{x}; \mu, \Sigma/N) \times W_P(\bar{S}_{N-1}; \Sigma, N-1) }[/math]

The likelihood is only proportional because the first constant is not used in any of the two distributions and a few constants are missing (eg. the Gamma function appearing in the density of the Wishart distribution). This doesn't matter as we usually want to maximize the likelihood or compute a likelihood ratio.

  • References:
    • Magnus and Neudecker, Matrix differential calculus with applications in statistics and econometrics (2007)
    • Wand, Vector differential calculus in statistics (The American Statistician, 2002)