# User:Timothee Flutre/Notebook/Postdoc/2012/08/16

Project name Main project page
Previous entry      Next entry

## Variational Bayes approach for the mixture of Normals

• Motivation: I have described on another page the basics of mixture models and the EM algorithm in a frequentist context. It is worth reading before continuing. Here I am interested in the Bayesian approach as well as in a specific variational method (nicknamed "Variational Bayes").

• Data: we have N univariate observations, $y_1, \ldots, y_N$, gathered into the vector $\mathbf{y}$.

• Assumptions: we assume the observations to be exchangeable and distributed according to a mixture of K Normal distributions. The parameters of this model are the mixture weights (wk), the means (μk) and the precisions (τk) of each mixture components, all gathered into $\Theta = \{w_1,\ldots,w_K,\mu_1,\ldots,\mu_K,\tau_1,\ldots,\tau_K\}$. There are two constraints: $\sum_{k=1}^K w_k = 1$ and $\forall k \; w_k > 0$.

• Observed likelihood: $p(\mathbf{y} | \Theta, K) = \prod_{n=1}^N p(y_n|\Theta,K) = \prod_{n=1}^N \sum_{k=1}^K w_k Normal(y_n;\mu_k,\tau_k^{-1})$

• Maximizing the observed log-likelihood: as shown here, maximizing the likelihood of a mixture model is like doing a weighted likelihood maximization. However, these weights depend on the parameters we want to estimate! That's why we now switch to the missing-data formulation of the mixture model.

• Latent variables: let's introduce N latent variables, $z_1,\ldots,z_N$, gathered into the vector $\mathbf{z}$. Each zn is a vector of length K with a single 1 indicating the component to which the nth observation belongs, and K-1 zeroes.

• Augmented likelihood: $p(\mathbf{y},\mathbf{z}|\Theta,K) = \prod_{n=1}^N p(y_n,z_n|\Theta,K) = \prod_{n=1}^N p(z_n|\Theta,K) p(y_n|z_n,\Theta,K) = \prod_{n=1}^N \prod_{k=1}^K w_k^{z_{nk}} Normal(y_n;\mu_k,\tau_k^{-1})^{z_{nk}}$

• Priors: in the Bayesian paradigm, parameters and latent variables are random variables for which we want to infer the posterior distribution. To make the calculations possible, we choose for them prior distributions that are conjuguate with the form of the likelihood.
• for the parameters: $\forall k \; \mu_k | \tau_k \sim Normal(\mu_0,(\tau_0 \tau_k)^{-1})$ and $\forall k \; \tau_k \sim Gamma(\alpha,\beta)$
• for the latent variables: $\forall n \; z_n \sim Multinomial_K(1,\mathbf{w})$ and $\mathbf{w} \sim Dirichlet(\gamma)$

• Variational Bayes: our primary goal here is to calculate the marginal log-likelihood of our data set:

$\mathrm{ln} \, p(\mathbf{y} | K) = \mathrm{ln} \, \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; p(\mathbf{y}, \mathbf{z}, \Theta | K)$

However the fact that there are latent variables induce dependencies between all the parameters of the model. This makes it difficult to find the parameters that maximize the marginal likelihood. An elegant solution is to introduce a "variational distribution" q of the parameters and the latent variables

$\mathrm{ln} \, p(\mathbf{y} | K) = \mathrm{ln} \, \left( \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; q(\mathbf{z}, \Theta) \; \frac{p(\mathbf{y}, \mathbf{z}, \Theta | K)}{q(\mathbf{z}, \Theta)} \right) + C_{\mathbf{z}, \Theta}$

The constant C is here to remind us that q has the constraint of being a distribution, ie. of summing to 1, which can be enforced by a Lagrange multiplier.

The crucial assumption is to assume the independence of the parameters and the latent variables:

$q(\mathbf{z}, \Theta) = q(\mathbf{z}) q(\Theta)$

We can then use the concavity of the logarithm and Jensen's inequality to optimize a lower bound of the marginal log-likelihood:

$\mathrm{ln} \, p(\mathbf{y} | K) \ge \int_\Theta \, \mathrm{d}\Theta \; \left( \int_\mathbf{z} \, \mathrm{d}\mathbf{z} \; q(\mathbf{z}) \; \mathrm{ln} \, \frac{p(\mathbf{y}, \mathbf{z} | \Theta, K)}{q(\mathbf{z})} + \mathrm{ln} \, \frac{p(\Theta | K)}{q(\Theta)} \right) + C_{\mathbf{z}} + C_{\Theta}$

Now we have to optimize the right-hand side of the inequality. Let's name it $\mathcal{F}$ as it is a functional, ie. a function of functions. Using the calculus of variations, we'll find the function q that maximizes it.