Learn about the multivariate Normal and matrix calculus
(Caution, this is my own quickanddirty tutorial, see the references at the end for presentations by professional statisticians.)
 Motivation: when we measure items, we often have to measure several properties for each item. For instance, we extract cells from each individual in our study, and we measure the expression level of all genes. We hence have, for each individual, a vector of measurements (one per gene), which leads us to the world of multivariate statistics.
 Data: we have N observations, noted X_{1},X_{2},...,X_{N}, each being of dimension P. This means that each X_{n} is a vector belonging to . Traditionally, we record all the data into an matrix named X, with samples (observations) in rows and variables (dimensions) in columns. Also, it is usual to assume vectors are in column by default. Therefore, we can write .
 Descriptive statistics: the sample mean is the Pdimensional vector with . The sample covariance is the matrix noted S^{2} with .
 Model: we suppose that the X_{n} are independent and identically distributed according to a multivariate Normal distribution N_{P}(μ,Σ). μ is the Pdimensional mean vector, and Σ the covariance matrix. If Σ is positive definite (which we will assume), the density function for a given observation is: , with  M  denoting the determinant of a matrix M and M^{T} its transpose.
 Likelihood: as usual, we will start by writing down the likelihood of the data, the parameters being θ = (μ,Σ):
L(θ) = f(X  θ)
As the observations are independent:
It is easier to work with the loglikelihood:
 ML estimation: as usual, to find the maximumlikelihood estimates of the parameters, we need to derive the loglikelihood with respect to each parameter, and then take the values of the parameters at which the loglikelihood is zero. However, in the case of multivariate distributions, this requires knowing a bit of matrix calculus, which is not always straightforward...
 Matrix calculus: some useful formulas
d(f(u)) = f'(u)du, eg. useful here: d(ln(  Σ  )) =  Σ  ^{ − 1}d(  Σ  )
d(  U  ) =  U  tr(U^{ − 1}dU)
d(U^{ − 1}) = − U^{ − 1}(dU)U^{ − 1}
 Technical details: from Magnus and Neudecker (third edition, Part Six, Chapter 15, Section 3, p.353). First, they rewrite the loglikelihood, noting that (x_{i} − μ)^{T}Σ^{ − 1}(x_{i} − μ) is a scalar, ie. a 1x1 matrix, and is therefore equal to its trace:
As the trace is invariant under cyclic permutations (tr(ABC) = tr(BCA) = tr(CAB)):
The trace is also a linear map (tr(A + B) = tr(A) + tr(B)):
And finally:
As a result:
with
We can now write the first differential of the loglikelihood:
The firstorder conditions (ie. when dl(θ) = 0) are:
and
From which follow:
and:
Note that is a biased estimate of Σ. It is usually better to use the unbiased estimate .
From Z defined above and defined similarly as but with N − 1 in the denominator, we can write the following:
Thus, by employing the same trick with the trace as above, we now have:
The likelihood depends on the samples only through the pair . Thanks to the Factorization theorem, we can say that this pair of values is a sufficient statistic for (μ,Σ).
We can also transform a bit more the formula of the likelihood in order to find the distribution of this sufficient statistic:
The likelihood is only proportional because the first constant is not used in any of the two distributions and a few constants are missing (eg. the Gamma function appearing in the density of the Wishart distribution). This doesn't matter as we usually want to maximize the likelihood or compute a likelihood ratio.
 References:
 Magnus and Neudecker, Matrix differential calculus with applications in statistics and econometrics (2007)
 Wand, Vector differential calculus in statistics (The American Statistician, 2002)
