# Difference between revisions of "User:Timothee Flutre/Notebook/Postdoc/2012/08/16"

Project name <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
<html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

## Variational Bayes approach for the mixture of Normals

• Motivation: I have described on another page the basics of mixture models and the EM algorithm in a frequentist context. It is worth reading before continuing. Here I am interested in the Bayesian approach as well as in a specific variational method (nicknamed "Variational Bayes").

• Data: N univariate observations, , gathered into the vector
• Model: mixture of K Normal distributions
• Parameters: K mixture weights (), K means () and K precisions (), one per mixture component

• Constraints: and .
• Observed likelihood: observations assumed exchangeable (independent and identically distributed given the parameters)

• Latent variables: N hidden variables, , each being a vector of length K with a single 1 indicating the component to which the observation belongs, and K-1 zeroes.

• Augmented likelihood:

• Maximum-likelihood estimation: integrate out the latent variables

The latent variables induce dependencies between all the parameters of the model. This makes it difficult to find the parameters that maximize the likelihood. An elegant solution is to introduce a variational distribution of parameters and latent variables, which leads to a re-formulation of the classical EM algorithm. But let's show it directly in the Bayesian paradigm.

• Priors: conjuguate
• and
• and

• Variational Bayes: let's focus here on calculating the marginal log-likelihood of our data set in order to perform model comparison:

We can now introduce a distribution :

The constant is here to remind us that has the constraint of being a distribution, ie. of summing to 1, which can be enforced by a Lagrange multiplier.

We can then use the concavity of the logarithm (Jensen's inequality) to derive a lower bound of the marginal log-likelihood:

Let's call this lower bound as it is a functional, ie. a function of functions. To gain some intuition about the impact of introducing , let's expand :

From this, it is clear that (ie. a lower-bound of the marginal log-likelihood) is the conditional log-likelihood minus the Kullback-Leibler divergence between the variational distribution and the joint posterior of latent variables and parameters. As a side note, minimizing is used in the expectation-propagation technique.

In practice, we have to make the following crucial assumption of independence on in order for the calculations to be analytically tractable:

This means that approximates the joint posterior, and therefore the lower-bound will be tight if and only if this approximation is exact and the KL divergence is zero.

As we ultimately aim at inferring the parameters and latent variables that maximize the marginal log-likelihood, we will use the calculus of variations to find the functions and that maximize the functional .

This naturally leads to a procedure very similar to the EM algorithm where, at the E step, we calculate the expectations of the parameters with respect to the variational distributions and , and, at the M step, we recompute the variational distributions over the parameters.

We start by writing the functional derivative of with respect to :

Then we set this functional derivative to zero. We also make use of a frequent assumption, namely that the variational distribution fully factorizes over each individual latent variables (mean-field assumption):

Recognizing the expectation and factorizing into , we get:

where

Taking the exponential:

As this should be a distribution, it should sum to one, and therefore:

where ("r" stands for "reponsability")

Interestingly, even though we haven't specified anything yet about , we can see that it is of the same form as the prior on , a Multinomial distribution.

We start by writing the functional derivative of with respect to :

Then, when setting this functional derivative to zero, we obtain:

Note how no term involves both and . This naturally implies the factorization . And we can also notice the following factorization .

Starting with :

Recognizing and taking the exponential, we get another Dirichlet distribution:

that is with

And now, about the other parameters, we recognize a Normal-Gamma distribution.

It's easier to first define three statistics of the data with respect to the responsabilities:

.

This allows us to concisely write (TODO: check the algebra...):

• Choice of K:

TODO

• References:
• book "Pattern Recognition and Machine Learning" from Christopher Bishop