User:Timothee Flutre/Notebook/Postdoc/2012/01/02: Difference between revisions
m (→Learn about the multivariate Normal and matrix calculus: mention Sbar_n) |
|||
Line 14: | Line 14: | ||
* '''Data''': we have N observations, noted <math>X = (x_1, x_2, ..., x_N)</math>, each being of dimension <math>P</math>. This means that each <math>x_i</math> is a vector belonging to <math>\mathbb{R}^P</math>. | * '''Data''': we have N observations, noted <math>X = (x_1, x_2, ..., x_N)</math>, each being of dimension <math>P</math>. This means that each <math>x_i</math> is a vector belonging to <math>\mathbb{R}^P</math>. | ||
* '''Model''': we suppose that the <math>x_i</math> are independent and identically distributed according to a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate Normal distribution] <math>N_P(\mu, \Sigma)</math>. <math>\mu</math> is the P-dimensional mean vector, and <math>\Sigma</math> the PxP covariance matrix. If <math>\Sigma</math> is [http://en.wikipedia.org/wiki/Positive-definite_matrix positive definite] (which we will assume), the density function for a given x is: <math>f(x | * '''Model''': we suppose that the <math>x_i</math> are independent and identically distributed according to a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate Normal distribution] <math>N_P(\mu, \Sigma)</math>. <math>\mu</math> is the P-dimensional mean vector, and <math>\Sigma</math> the PxP covariance matrix. If <math>\Sigma</math> is [http://en.wikipedia.org/wiki/Positive-definite_matrix positive definite] (which we will assume), the density function for a given x is: <math>f(x|\mu,\Sigma) = (2 \pi)^{-P/2} |\Sigma|^{-1/2} exp(-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu))</math>, with <math>|M|</math> denoting the determinant of a matrix and <math>M^T</math> its transpose. | ||
* '''Likelihood''': as usual, we will start by writing down the likelihood of the data, the parameters being <math>\theta=(\mu,\Sigma)</math>: | * '''Likelihood''': as usual, we will start by writing down the likelihood of the data, the parameters being <math>\theta=(\mu,\Sigma)</math>: | ||
<math>L(\theta) = | <math>L(\theta) = f(X|\theta)</math> | ||
As the observations are independent: | As the observations are independent: | ||
<math>L(\theta) = \prod_{i=1}^N f(x_i | <math>L(\theta) = \prod_{i=1}^N f(x_i | \theta)</math> | ||
It is easier to work with the log-likelihood: | It is easier to work with the log-likelihood: | ||
<math>l(\theta) = ln(L(\theta)) = \sum_{i=1}^N ln( f(x_i | <math>l(\theta) = ln(L(\theta)) = \sum_{i=1}^N ln( f(x_i | \theta) )</math> | ||
<math>l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu)</math> | <math>l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu)</math> | ||
Line 32: | Line 32: | ||
* '''ML estimation''': as usual, to find the [http://en.wikipedia.org/wiki/Maximum_likelihood maximum-likelihood estimates] of the parameters, we need to derive the log-likelihood with respect to each parameter, and then take the values of the parameters at which the log-likelihood is zero. However, in the case of multivariate distributions, this requires knowing a bit of [http://en.wikipedia.org/wiki/Matrix_calculus matrix calculus], which is not always straightforward... | * '''ML estimation''': as usual, to find the [http://en.wikipedia.org/wiki/Maximum_likelihood maximum-likelihood estimates] of the parameters, we need to derive the log-likelihood with respect to each parameter, and then take the values of the parameters at which the log-likelihood is zero. However, in the case of multivariate distributions, this requires knowing a bit of [http://en.wikipedia.org/wiki/Matrix_calculus matrix calculus], which is not always straightforward... | ||
* '''Matrix calculus''': | * '''Matrix calculus''': some useful formulas | ||
<math>d(f(u)) = f'(u) du</math>, eg. useful here: <math>d(ln(|\Sigma|)) = |\Sigma|^{-1} d(|\Sigma|)</math> | <math>d(f(u)) = f'(u) du</math>, eg. useful here: <math>d(ln(|\Sigma|)) = |\Sigma|^{-1} d(|\Sigma|)</math> | ||
Line 76: | Line 76: | ||
<math>d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - n\Sigma) \Sigma^{-1} + n (d\mu)^T \Sigma^{-1} (\bar{x} - \mu)</math> | <math>d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - n\Sigma) \Sigma^{-1} + n (d\mu)^T \Sigma^{-1} (\bar{x} - \mu)</math> | ||
The first-order conditions are: | The first-order conditions (ie. when <math>d l(\theta) = 0</math>) are: | ||
<math>\hat{\Sigma}^{-1} (Z - n\hat{\Sigma}) \hat{\Sigma}^{-1} = 0</math> and <math>\hat{\Sigma}^{-1} (\bar{x} - \hat{\mu}) = 0</math> | <math>\hat{\Sigma}^{-1} (Z - n\hat{\Sigma}) \hat{\Sigma}^{-1} = 0</math> and <math>\hat{\Sigma}^{-1} (\bar{x} - \hat{\mu}) = 0</math> | ||
Line 86: | Line 86: | ||
and: | and: | ||
<math>\hat{\Sigma} = \frac{1}{n} \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T</math> | <math>\hat{\Sigma} = \bar{S}_n = \frac{1}{n} \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T</math> | ||
* '''References''': | * '''References''': |
Revision as of 13:08, 5 April 2012
Project name | <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page <html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html> </html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html> |
Learn about the multivariate Normal and matrix calculus(Caution, this is my own quick-and-dirty tutorial, see the references at the end for presentations by professional statisticians.)
[math]\displaystyle{ L(\theta) = f(X|\theta) }[/math] As the observations are independent: [math]\displaystyle{ L(\theta) = \prod_{i=1}^N f(x_i | \theta) }[/math] It is easier to work with the log-likelihood: [math]\displaystyle{ l(\theta) = ln(L(\theta)) = \sum_{i=1}^N ln( f(x_i | \theta) ) }[/math] [math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) }[/math]
[math]\displaystyle{ d(f(u)) = f'(u) du }[/math], eg. useful here: [math]\displaystyle{ d(ln(|\Sigma|)) = |\Sigma|^{-1} d(|\Sigma|) }[/math] [math]\displaystyle{ d(|U|) = |U| tr(U^{-1} dU) }[/math] [math]\displaystyle{ d(U^{-1}) = - U^{-1} (dU) U^{-1} }[/math]
[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) ) }[/math] As the trace is invariant under cyclic permutations ([math]\displaystyle{ tr(ABC) = tr(BCA) = tr(CAB) }[/math]): [math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math] The trace is also a linear map ([math]\displaystyle{ tr(A+B) = tr(A) + tr(B) }[/math]): [math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = tr( \sum_{i=1}^N \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math] And finally: [math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = tr( \Sigma^{-1} \sum_{i=1}^N (x_i-\mu) (x_i-\mu)^T ) }[/math] As a result: [math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} tr(\Sigma^{-1} Z) }[/math] with [math]\displaystyle{ Z=\sum_{i=1}^N(x_i-\mu)(x_i-\mu)^T }[/math] We can now write the first differential of the log-likelihood: [math]\displaystyle{ d l(\theta) = - \frac{N}{2} d(ln(|\Sigma|)) - \frac{1}{2} d(tr(\Sigma^{-1} Z)) }[/math] [math]\displaystyle{ d l(\theta) = - \frac{N}{2} |\Sigma|^{-1} d(|\Sigma|) - \frac{1}{2} tr(d(\Sigma^{-1}Z)) }[/math] [math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) - \frac{1}{2} tr(d(\Sigma^{-1})Z) - \frac{1}{2} tr(\Sigma^{-1} dZ) }[/math] [math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) + \frac{1}{2} tr(\Sigma^{-1} (d\Sigma) \Sigma^{-1} Z) + \frac{1}{2} tr(\Sigma^{-1} (\sum_{i=1}^N (x_i - \mu) (d\mu)^T + \sum_{i=1}^N (d\mu) (x_i - \mu)^T)) }[/math] At this step in the book, I don't understand how we go from the line above to the line below: [math]\displaystyle{ d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - n\Sigma) \Sigma^{-1} + (d\mu)^T \Sigma^{-1} \sum_{i=1}^N (x_i - \mu) }[/math] [math]\displaystyle{ d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - n\Sigma) \Sigma^{-1} + n (d\mu)^T \Sigma^{-1} (\bar{x} - \mu) }[/math] The first-order conditions (ie. when [math]\displaystyle{ d l(\theta) = 0 }[/math]) are: [math]\displaystyle{ \hat{\Sigma}^{-1} (Z - n\hat{\Sigma}) \hat{\Sigma}^{-1} = 0 }[/math] and [math]\displaystyle{ \hat{\Sigma}^{-1} (\bar{x} - \hat{\mu}) = 0 }[/math] From which follow: [math]\displaystyle{ \hat{\mu} = \bar{x} = \frac{1}{n} \sum_{i=1}^N x_i }[/math] and: [math]\displaystyle{ \hat{\Sigma} = \bar{S}_n = \frac{1}{n} \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T }[/math]
|