User:Timothee Flutre/Notebook/Postdoc/2012/08/16

From OpenWetWare

< User:Timothee Flutre | Notebook | Postdoc | 2012 | 08(Difference between revisions)
Jump to: navigation, search
(Variational Bayes approach for the mixture of Normals: add principle of Variational Bayes)
Current revision (20:28, 5 August 2013) (view source)
(Variational Bayes approach for the mixture of Normals: finish update q_mu,tau)
 
(10 intermediate revisions not shown.)
Line 11: Line 11:
-
* '''Data''': we have N univariate observations, <math>y_1, \ldots, y_N</math>, gathered into the vector <math>\mathbf{y}</math>.
+
* '''Data''': N univariate observations, <math>y_1, \ldots, y_N</math>, gathered into the vector <math>\mathbf{y}</math>
 +
* '''Model''': mixture of K Normal distributions
-
* '''Assumptions''': we assume the observations to be exchangeable and distributed according to a mixture of K Normal distributions. The parameters of this model are the mixture weights (<math>w_k</math>), the means (<math>\mu_k</math>) and the [http://en.wikipedia.org/wiki/Precision_%28statistics%29 precisions] (<math>\tau_k</math>) of each mixture components, all gathered into <math>\Theta = \{w_1,\ldots,w_K,\mu_1,\ldots,\mu_K,\tau_1,\ldots,\tau_K\}</math>. There are two constraints: <math>\sum_{k=1}^K w_k = 1</math> and <math>\forall k \; w_k > 0</math>.
+
* '''Parameters''': K mixture weights (<math>w_k</math>), K means (<math>\mu_k</math>) and K [http://en.wikipedia.org/wiki/Precision_%28statistics%29 precisions] (<math>\tau_k</math>), one per mixture component
 +
<math>\Theta = \{w_1,\ldots,w_K,\mu_1,\ldots,\mu_K,\tau_1,\ldots,\tau_K\}</math>
-
* '''Observed likelihood''': <math>p(\mathbf{y} | \Theta, K) = \prod_{n=1}^N p(y_n|\Theta,K) = \prod_{n=1}^N \sum_{k=1}^K w_k Normal(y_n;\mu_k,\tau_k^{-1})</math>
+
* '''Constraints''': <math>\sum_{k=1}^K w_k = 1</math> and <math>\forall k \; w_k > 0</math>.
 +
* '''Observed likelihood''':  observations assumed exchangeable (independent and identically distributed given the parameters)
-
* '''Maximizing the observed log-likelihood''': as shown [http://openwetware.org/wiki/User:Timothee_Flutre/Notebook/Postdoc/2011/12/14 here], maximizing the likelihood of a mixture model is like doing a weighted likelihood maximization. However, these weights depend on the parameters we want to estimate! That's why we now switch to the missing-data formulation of the mixture model.
+
<math>p(\mathbf{y} | \Theta, K) = \prod_{n=1}^N p(y_n|\Theta,K) = \prod_{n=1}^N \sum_{k=1}^K w_k \; \mathcal{N}(y_n;\mu_k,\tau_k^{-1})</math>
 +
* '''Latent variables''': N hidden variables, <math>z_1,\ldots,z_N</math>, each being a vector of length K with a single 1 indicating the component to which the <math>n^{th}</math> observation belongs, and K-1 zeroes.
-
* '''Latent variables''': let's introduce N latent variables, <math>z_1,\ldots,z_N</math>, gathered into the vector <math>\mathbf{z}</math>. Each <math>z_n</math> is a vector of length K with a single 1 indicating the component to which the <math>n^{th}</math> observation belongs, and K-1 zeroes.
+
<math>p(\mathbf{z}|\mathbf{w},K) = \prod_{n=1}^N p(z_n|\mathbf{w},K) = \prod_{n=1}^N \prod_{k=1}^K w_k^{z_{nk}}</math>
 +
* '''Augmented likelihood''':
-
* '''Augmented likelihood''': <math>p(\mathbf{y},\mathbf{z}|\Theta,K) = \prod_{n=1}^N p(y_n,z_n|\Theta,K) = \prod_{n=1}^N p(z_n|\Theta,K) p(y_n|z_n,\Theta,K) = \prod_{n=1}^N \prod_{k=1}^K w_k^{z_{nk}} Normal(y_n;\mu_k,\tau_k^{-1})^{z_{nk}}</math>
+
<math>p(\mathbf{y},\mathbf{z}|\Theta,K) = \prod_{n=1}^N p(y_n,z_n|\Theta,K) = \prod_{n=1}^N p(z_n|\Theta,K) p(y_n|z_n,\Theta,K) = \prod_{n=1}^N \prod_{k=1}^K w_k^{z_{nk}} \; \mathcal{N}(y_n;\mu_k,\tau_k^{-1})^{z_{nk}}</math>
 +
* '''Maximum-likelihood estimation''': integrate out the latent variables
-
* '''Priors''': in the Bayesian paradigm, parameters and latent variables are random variables for which we want to infer the posterior distribution. To make the calculations possible, we choose for them prior distributions that are conjuguate with the form of the likelihood.
+
<math>\mathrm{ln} \, p(\mathbf{y} | \Theta, K) = \sum_{n=1}^N \mathrm{ln} \, \int_{z_n} p(y_n, z_n | \Theta, K) \, \mathrm{d}{z_n}</math>
-
** for the parameters: <math>\forall k \; \mu_k | \tau_k \sim Normal(\mu_0,(\tau_0 \tau_k)^{-1})</math> and <math>\forall k \; \tau_k \sim Gamma(\alpha,\beta)</math>
+
-
** for the latent variables: <math>\forall n \; z_n \sim Multinomial_K(1,\mathbf{w})</math> and <math>\mathbf{w} \sim Dirichlet(\gamma)</math>
+
 +
The latent variables induce dependencies between all the parameters of the model.
 +
This makes it difficult to find the parameters that maximize the likelihood.
 +
An elegant solution is to introduce a variational distribution of parameters and latent variables, which leads to a re-formulation of the classical EM algorithm.
 +
But let's show it directly in the Bayesian paradigm.
-
* '''Variational Bayes''': our primary goal here is to calculate the marginal log-likelihood of our data set:
+
* '''Priors''': conjuguate
 +
** <math>p(\Theta | K) = p(\mathbf{w} | K) \prod_k p(\tau_k) p(\mu_k | \tau_k)</math>
 +
** <math>\forall k \; \tau_k \sim \mathcal{G}a(\alpha,\beta)</math> and <math>\forall k \; \mu_k | \tau_k \sim \mathcal{N}(\mu_0,(\tau_0 \tau_k)^{-1})</math>
 +
** <math>\forall n \; z_n \sim \mathcal{M}ult_K(1,\mathbf{w})</math> and <math>\mathbf{w} \sim \mathcal{D}ir(\gamma_0)</math>
-
<math>\mathrm{ln} \, p(\mathbf{y} | K) = \mathrm{ln} \, \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; p(\mathbf{y}, \mathbf{z}, \Theta | K)</math>
 
-
However the fact that there are latent variables induce dependencies between all the parameters of the model.
+
* '''Variational Bayes''': let's focus here on calculating the marginal log-likelihood of our data set in order to perform model comparison:
-
This makes it difficult to find the parameters that maximize the marginal likelihood.
+
-
An elegant solution is to introduce a "variational distribution" <math>q</math> of the parameters and the latent variables
+
-
<math>\mathrm{ln} \, p(\mathbf{y} | K) = \mathrm{ln} \, \left( \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; q(\mathbf{z}, \Theta) \; \frac{p(\mathbf{y}, \mathbf{z}, \Theta | K)}{q(\mathbf{z}, \Theta)} \right) + C_{\mathbf{z}, \Theta}</math>
+
<math>\mathrm{ln} \, p(\mathbf{y} | K) = \mathrm{ln} \, \int_\mathbf{z} \int_\Theta \; p(\mathbf{y}, \mathbf{z}, \Theta | K) \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta</math>
-
The constant <math>C</math> is here to remind us that <math>q</math> has the constraint of being a distribution, ie. of summing to 1, which can be enforced by a Lagrange multiplier.
+
We can now introduce a distribution <math>q_{\mathbf{z}, \Theta}</math>:
-
The '''crucial assumption''' is to assume the independence of the parameters and the latent variables:
+
<math>\mathrm{ln} \, p(\mathbf{y} | K) = \mathrm{ln} \, \left( \int_\mathbf{z} \int_\Theta \; q_{\mathbf{z}, \Theta}(\mathbf{z}, \Theta) \; \frac{p(\mathbf{y}, \mathbf{z}, \Theta | K)}{q_{\mathbf{z}, \Theta}(\mathbf{z}, \Theta)} \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \right) + C_{\mathbf{z}, \Theta}</math>
-
<math>q(\mathbf{z}, \Theta) = q(\mathbf{z}) q(\Theta)</math>
+
The constant <math>C_{\mathbf{z}, \Theta}</math> is here to remind us that <math>q_{\mathbf{z}, \Theta}</math> has the constraint of being a distribution, ie. of summing to 1, which can be enforced by a Lagrange multiplier.
-
We can then use the concavity of the logarithm and Jensen's inequality to optimize a lower bound of the marginal log-likelihood:
+
We can then use the concavity of the logarithm ([http://en.wikipedia.org/wiki/Jensen%27s_inequality Jensen's inequality]) to derive a lower bound of the marginal log-likelihood:
-
<math>\mathrm{ln} \, p(\mathbf{y} | K) \ge \int_\Theta \, \mathrm{d}\Theta \; \left( \int_\mathbf{z} \, \mathrm{d}\mathbf{z} \; q(\mathbf{z}) \; \mathrm{ln} \, \frac{p(\mathbf{y}, \mathbf{z} | \Theta, K)}{q(\mathbf{z})} + \mathrm{ln} \, \frac{p(\Theta | K)}{q(\Theta)} \right) + C_{\mathbf{z}} + C_{\Theta}</math>
+
<math>\mathrm{ln} \, p(\mathbf{y} | K) \ge \int_\mathbf{z} \int_\Theta \; q_{\mathbf{z}, \Theta}(\mathbf{z}, \Theta) \; \mathrm{ln} \, \frac{p(\mathbf{y}, \mathbf{z}, \Theta | K)}{q_{\mathbf{z}, \Theta}(\mathbf{z}, \Theta)} \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta + C_{\mathbf{z}, \Theta} = \mathcal{F}_K(q)</math>
-
Now we have to optimize the right-hand side of the inequality. Let's name it <math>\mathcal{F}</math> as it is a [http://en.wikipedia.org/wiki/Functional_%28mathematics%29 functional], ie. a ''function of functions''. Using the [http://en.wikipedia.org/wiki/Calculus_of_variations calculus of variations], we'll find the function <math>q</math> that maximizes it.
+
Let's call this lower bound <math>\mathcal{F}_K(q)</math> as it is a [http://en.wikipedia.org/wiki/Functional_%28mathematics%29 functional], ie. a ''function of functions''. To gain some intuition about the impact of introducing <math>q</math>, let's expand <math>\mathcal{F}_K</math>:
 +
 
 +
<math>\mathcal{F}_K(q) = \int_\mathbf{z} \int_\Theta \; q_{\mathbf{z}, \Theta}(\mathbf{z}, \Theta) \; \mathrm{ln} \, p(\mathbf{y} | \mathbf{z}, \Theta, K) \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; + \; \int_\mathbf{z} \int_\Theta \; q_{\mathbf{z}, \Theta}(\mathbf{z}, \Theta) \; \mathrm{ln} \, \frac{p(\mathbf{z}, \Theta | K)}{q_{\mathbf{z}, \Theta}(\mathbf{z}, \Theta)} \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; + \; C_{\mathbf{z}, \Theta}</math>
 +
 
 +
<math>\mathcal{F}_K(q) = \mathrm{ln} \, p(\mathbf{y} | \mathbf{z}, \Theta, K) - D_{KL}(q || p)</math>
 +
 
 +
From this, it is clear that <math>\mathcal{F}_K</math> (ie. a lower-bound of the marginal log-likelihood) is the conditional log-likelihood minus the [http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence Kullback-Leibler divergence] between the variational distribution <math>q</math> and the joint posterior of latent variables and parameters.
 +
As a side note, minimizing <math>D_{KL}(p || q)</math> is used in the [http://en.wikipedia.org/wiki/Expectation_propagation expectation-propagation] technique.
 +
 
 +
In practice, we have to make the following crucial assumption of independence on <math>q_{\mathbf{z}, \Theta}</math> in order for the calculations to be analytically tractable:
 +
 
 +
<math>q_{\mathbf{z}, \Theta}(\mathbf{z}, \Theta) = q_{\mathbf{z}}(\mathbf{z}) q_\Theta(\Theta)</math>
 +
 
 +
This means that <math>q_\mathbf{z} q_\Theta</math> approximates the joint posterior, and therefore the lower-bound will be tight if and only if this approximation is exact and the KL divergence is zero.
 +
 
 +
As we ultimately aim at inferring the parameters and latent variables that maximize the marginal log-likelihood, we will use the [http://en.wikipedia.org/wiki/Calculus_of_variations calculus of variations] to find the functions <math>q_\mathbf{z}</math> and <math>q_\Theta</math> that maximize the functional <math>\mathcal{F}_K</math>.
 +
 
 +
<math>\mathcal{F}_K(q_\mathbf{z}, q_\Theta) = \int_\Theta \; q_\Theta(\Theta) \; \left( \int_\mathbf{z} \; q_\mathbf{z}(\mathbf{z}) \; \mathrm{ln} \, \frac{p(\mathbf{y}, \mathbf{z} | \Theta, K)}{q_\mathbf{z}(\mathbf{z})} \, \mathrm{d}\mathbf{z} + \mathrm{ln} \, \frac{p(\Theta | K)}{q_\Theta(\Theta)} \right) \, \mathrm{d}\Theta \; + \; C_{\mathbf{z}} \; + \; C_{\Theta}</math>
 +
 
 +
This naturally leads to a procedure very similar to the EM algorithm where, at the E step, we calculate the expectations of the parameters with respect to the variational distributions <math>q_\mathbf{z}</math> and <math>q_\Theta</math>, and, at the M step, we recompute the variational distributions over the parameters.
 +
 
 +
 
 +
* '''Updates for <math>q_\mathbf{z}</math>''':
 +
 
 +
We start by writing the functional derivative of <math>\mathcal{F}_K</math> with respect to <math>q_{\mathbf{z}}</math>:
 +
 
 +
<math>\frac{\partial \mathcal{F}_K}{\partial q_{\mathbf{z}}} = \int_\Theta \; q_\Theta(\Theta) \; \frac{\partial}{\partial q_{\mathbf{z}}} \left( \int_\mathbf{z} \; \left( q_{\mathbf{z}}(\mathbf{z}) \mathrm{ln} \, p(\mathbf{y},\mathbf{z}|\Theta,K) - q_{\mathbf{z}}(\mathbf{z}) \mathrm{ln} \, q_{\mathbf{z}}(\mathbf{z}) \right) \, \mathrm{d}\mathbf{z} \right) \, \mathrm{d}\Theta \; + \; C_{\mathbf{z}}</math>
 +
 
 +
<math>\frac{\partial \mathcal{F}_K}{\partial q_{\mathbf{z}}} = \int_\Theta \; q_\Theta(\Theta) \; \left( \mathrm{ln} \, p(\mathbf{y},\mathbf{z}|\Theta,K) - \mathrm{ln} \, q_{\mathbf{z}}(\mathbf{z}) - 1 \right) \, \mathrm{d}\Theta \; + \; C_{\mathbf{z}}</math>
 +
 
 +
Then we set this functional derivative to zero. We also make use of a frequent assumption, namely that the variational distribution fully factorizes over each individual latent variables ([http://en.wikipedia.org/wiki/Mean_field_theory mean-field assumption]):
 +
 
 +
<math>\frac{\partial \mathcal{F}_K}{\partial q_{\mathbf{z}}} \bigg|_{q_{\mathbf{z}}^{(t+1)}} = 0 \Longleftrightarrow \forall \, n \; \mathrm{ln} \, q_{z_n}^{(t+1)}(z_n) = \int_\Theta \; q_\Theta(\Theta) \; \mathrm{ln} \, p(y_n,z_n|\Theta,K) \, \mathrm{d}\Theta \; + \; C_{z_n}</math>
 +
 
 +
Recognizing the expectation and factorizing <math>q_\Theta(\Theta)</math> into <math>q_\mathbf{w}(\mathbf{w})q_\mathbf{\mu,\tau}(\mathbf{\mu,\tau})</math>, we get:
 +
 
 +
<math>\mathrm{ln} \, q_{z_n}^{(t+1)}(z_n) = \mathbb{E}_\mathbf{w}[\mathrm{ln} \, p(z_n|\mathbf{w},K)] + \mathbb{E}_{\mathbf{\mu,\tau}}[\mathrm{ln} \, p(y_n|z_n,\mathbf{\mu},\mathbf{\tau},K)] \; + \; \text{constant}</math>
 +
 
 +
<math>\mathrm{ln} \, q_{z_n}^{(t+1)}(z_n) = \sum_{k=1}^K ( z_{nk} \; \mathrm{ln} \, \rho_{nk} ) \; + \; \text{constant}</math> where <math>\mathrm{ln} \, \rho_{nk} = \mathbb{E}[\mathrm{ln} \, w_k] + \frac{1}{2} \mathbb{E}[\mathrm{ln} \, \tau_k] - \frac{1}{2} \mathrm{ln} \, 2\pi - \frac{1}{2} \mathbb{E}[\tau_k (y_n-\mu_k)^2]</math>
 +
 
 +
Taking the exponential: <math>q_{z_n}^{(t+1)}(z_n) \propto \prod_k \rho_{nk}^{z_{nk}}</math>
 +
 
 +
As this should be a distribution, it should sum to one, and therefore:
 +
 
 +
<math>q_{z_n}^{(t+1)}(z_n) = \prod_k r_{nk}^{z_{nk}}</math> where <math>r_{nk} = \frac{\rho_{nk}}{\sum_{k'=1}^K \rho_{nk'}}</math> ("r" stands for "reponsability")
 +
 
 +
Interestingly, even though we haven't specified anything yet about <math>q_{z_n}</math>, we can see that it is of the same form as the prior on <math>z_n</math>, a Multinomial distribution.
 +
 
 +
 
 +
* '''Updates for <math>q_\Theta</math>''':
 +
 
 +
We start by writing the functional derivative of <math>\mathcal{F}_K</math> with respect to <math>q_{\Theta}</math>:
 +
 
 +
<math>\frac{\partial \mathcal{F}_K}{\partial q_\Theta} = \frac{\partial}{\partial q_\Theta} \left( \int_\Theta \; q_\Theta(\Theta) \; \left( \int_\mathbf{z} \; q_\mathbf{z}(\mathbf{z}) \; \mathrm{ln} \, p(\mathbf{y}, \mathbf{z} | \Theta, K) \, \mathrm{d}\mathbf{z} + \mathrm{ln} \, \frac{p(\Theta | K)}{q_\Theta(\Theta)} \right) \, \mathrm{d}\Theta \right) \; + \; C_{\Theta}</math>
 +
 
 +
<math>\frac{\partial \mathcal{F}_K}{\partial q_\Theta} = \int_\mathbf{z} \; q_\mathbf{z}(\mathbf{z}) \; \mathrm{ln} \, p(\mathbf{z} | \Theta, K) \; \mathrm{ln} \, p(\mathbf{y} | \mathbf{z}, \Theta, K) \, \mathrm{d}\mathbf{z} \; + \; \mathrm{ln} \, p(\Theta | K) - \mathrm{ln} \, q_\Theta(\Theta) \; - 1 \; + \; C_{\Theta}</math>
 +
 
 +
Then, when setting this functional derivative to zero, we obtain:
 +
 
 +
<math>\mathrm{ln} \, q_\Theta(\Theta)^{(t+1)} = \mathbb{E}_\mathbf{z}[\mathrm{ln} \, p(\mathbf{z} | \mathbf{w}, K)] \; + \; \sum_n \sum_k \mathbb{E}[z_{nk}] \mathrm{ln} \, p(y_n | \mu_k, \tau_k) \; + \; \mathrm{ln} \, p(\mathbf{w} | K) \; + \; \sum_k \mathrm{ln} \, p(\mu_k, \tau_k | K) \; + \; \text{constant}</math>
 +
 
 +
Note how no term involves both <math>\mathbf{w}</math> and <math>\mu_k,\tau_k</math>.
 +
This naturally implies the factorization <math>q_\Theta(\Theta) = q_\mathbf{w}(\mathbf{w})q_\mathbf{\mu,\tau}(\mathbf{\mu,\tau})</math>.
 +
And we can also notice the following factorization <math>q_\mathbf{\mu,\tau}(\mathbf{\mu,\tau}) = \prod_k q(\mu_k, \tau_k)</math>.
 +
 
 +
Starting with <math>\mathbf{w}</math>:
 +
 
 +
<math>\mathrm{ln} \, q_\mathbf{w}^{(t+1)}(\mathbf{w}) = \sum_n \sum_k \mathbb{E}[z_{nk}] \, \mathrm{ln} \, w_k \; + \; (\gamma_0 - 1) \sum_k \mathrm{ln} \, w_k \; + \; \text{constant}</math>
 +
 
 +
Recognizing <math>\mathbb{E}[z_{nk}] = r_{nk}</math> and taking the exponential, we get another Dirichlet distribution:
 +
 
 +
<math>q_\mathbf{w}^{(t+1)}(\mathbf{w}) \propto \prod_k w_k^{\gamma_0-1+\sum_n r_{nk}}</math>
 +
 
 +
that is <math>q_\mathbf{w}^{(t+1)} \sim \mathcal{D}ir(\gamma^{(t+1)})</math> with <math>\gamma_k^{(t+1)}=\gamma_0-1+\sum_n r_{nk}</math>
 +
 
 +
And now, about the other parameters, we recognize a Normal-Gamma distribution.
 +
 
 +
<math>q(\mu_k,\tau_k) = q(\tau_k) q(\mu_k | \tau_k) = \mathcal{N}\mathcal{G}a(\tilde{\mu}_k, \tilde{\tau}_k^{-1}, \tilde{\alpha}_k, \tilde{\beta}_k)</math>
 +
 
 +
It's easier to first define three statistics of the data with respect to the responsabilities:
 +
 
 +
<math>N_k = \sum_n r_{nk}</math>
 +
 
 +
<math>\bar{y}_k = \frac{1}{N_k} \sum_n r_{nk} y_n</math>
 +
 
 +
<math>S_k = \frac{1}{N_k} \sum_n r_{nk} (y_n - \bar{y}_k)^2</math>.
 +
 
 +
This allows us to concisely write (TODO: check the algebra...):
 +
 
 +
<math>\tilde{\tau}_k = \tau_0 + N_k</math>
 +
 
 +
<math>\tilde{\mu}_k = \frac{1}{\tilde{\tau}_k} (\tau_0 \mu_0 + N_k \bar{y}_k)</math>
 +
 
 +
<math>\tilde{\alpha}_k^{-1} = \alpha^{-1} + N_k S_k + \frac{\tau_0 N_k}{\tau_0 + N_k}(\bar{y}_k - \mu_0)^2</math>
 +
 
 +
<math>\tilde{\beta}_k = \beta + N_k + 1</math>
 +
 
 +
 
 +
* '''Choice of K''':
 +
 
 +
TODO
 +
 
 +
 
 +
* '''References''':
 +
** book "Pattern Recognition and Machine Learning" from Christopher Bishop
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->

Current revision

Project name Main project page
Previous entry      Next entry

Variational Bayes approach for the mixture of Normals

  • Motivation: I have described on another page the basics of mixture models and the EM algorithm in a frequentist context. It is worth reading before continuing. Here I am interested in the Bayesian approach as well as in a specific variational method (nicknamed "Variational Bayes").


  • Data: N univariate observations, y_1, \ldots, y_N, gathered into the vector \mathbf{y}
  • Model: mixture of K Normal distributions
  • Parameters: K mixture weights (wk), K means (μk) and K precisions (τk), one per mixture component

\Theta = \{w_1,\ldots,w_K,\mu_1,\ldots,\mu_K,\tau_1,\ldots,\tau_K\}

  • Constraints: \sum_{k=1}^K w_k = 1 and \forall k \; w_k > 0.
  • Observed likelihood: observations assumed exchangeable (independent and identically distributed given the parameters)

p(\mathbf{y} | \Theta, K) = \prod_{n=1}^N p(y_n|\Theta,K) = \prod_{n=1}^N \sum_{k=1}^K w_k \; \mathcal{N}(y_n;\mu_k,\tau_k^{-1})

  • Latent variables: N hidden variables, z_1,\ldots,z_N, each being a vector of length K with a single 1 indicating the component to which the nth observation belongs, and K-1 zeroes.

p(\mathbf{z}|\mathbf{w},K) = \prod_{n=1}^N p(z_n|\mathbf{w},K) = \prod_{n=1}^N \prod_{k=1}^K w_k^{z_{nk}}

  • Augmented likelihood:

p(\mathbf{y},\mathbf{z}|\Theta,K) = \prod_{n=1}^N p(y_n,z_n|\Theta,K) = \prod_{n=1}^N p(z_n|\Theta,K) p(y_n|z_n,\Theta,K) = \prod_{n=1}^N \prod_{k=1}^K w_k^{z_{nk}} \; \mathcal{N}(y_n;\mu_k,\tau_k^{-1})^{z_{nk}}

  • Maximum-likelihood estimation: integrate out the latent variables

\mathrm{ln} \, p(\mathbf{y} | \Theta, K) = \sum_{n=1}^N \mathrm{ln} \, \int_{z_n} p(y_n, z_n | \Theta, K) \, \mathrm{d}{z_n}

The latent variables induce dependencies between all the parameters of the model. This makes it difficult to find the parameters that maximize the likelihood. An elegant solution is to introduce a variational distribution of parameters and latent variables, which leads to a re-formulation of the classical EM algorithm. But let's show it directly in the Bayesian paradigm.

  • Priors: conjuguate
    • p(\Theta | K) = p(\mathbf{w} | K) \prod_k p(\tau_k) p(\mu_k | \tau_k)
    • \forall k \; \tau_k \sim \mathcal{G}a(\alpha,\beta) and \forall k \; \mu_k | \tau_k \sim \mathcal{N}(\mu_0,(\tau_0 \tau_k)^{-1})
    • \forall n \; z_n \sim \mathcal{M}ult_K(1,\mathbf{w}) and \mathbf{w} \sim \mathcal{D}ir(\gamma_0)


  • Variational Bayes: let's focus here on calculating the marginal log-likelihood of our data set in order to perform model comparison:

\mathrm{ln} \, p(\mathbf{y} | K) = \mathrm{ln} \, \int_\mathbf{z} \int_\Theta \; p(\mathbf{y}, \mathbf{z}, \Theta | K) \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta

We can now introduce a distribution q_{\mathbf{z}, \Theta}:

\mathrm{ln} \, p(\mathbf{y} | K) = \mathrm{ln} \, \left( \int_\mathbf{z} \int_\Theta \; q_{\mathbf{z}, \Theta}(\mathbf{z}, \Theta) \; \frac{p(\mathbf{y}, \mathbf{z}, \Theta | K)}{q_{\mathbf{z}, \Theta}(\mathbf{z}, \Theta)} \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \right) + C_{\mathbf{z}, \Theta}

The constant C_{\mathbf{z}, \Theta} is here to remind us that q_{\mathbf{z}, \Theta} has the constraint of being a distribution, ie. of summing to 1, which can be enforced by a Lagrange multiplier.

We can then use the concavity of the logarithm (Jensen's inequality) to derive a lower bound of the marginal log-likelihood:

\mathrm{ln} \, p(\mathbf{y} | K) \ge \int_\mathbf{z} \int_\Theta \; q_{\mathbf{z}, \Theta}(\mathbf{z}, \Theta) \; \mathrm{ln} \, \frac{p(\mathbf{y}, \mathbf{z}, \Theta | K)}{q_{\mathbf{z}, \Theta}(\mathbf{z}, \Theta)} \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta + C_{\mathbf{z}, \Theta} = \mathcal{F}_K(q)

Let's call this lower bound \mathcal{F}_K(q) as it is a functional, ie. a function of functions. To gain some intuition about the impact of introducing q, let's expand \mathcal{F}_K:

\mathcal{F}_K(q) = \int_\mathbf{z} \int_\Theta \; q_{\mathbf{z}, \Theta}(\mathbf{z}, \Theta) \; \mathrm{ln} \, p(\mathbf{y} | \mathbf{z}, \Theta, K) \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; + \; \int_\mathbf{z} \int_\Theta \; q_{\mathbf{z}, \Theta}(\mathbf{z}, \Theta) \; \mathrm{ln} \, \frac{p(\mathbf{z}, \Theta | K)}{q_{\mathbf{z}, \Theta}(\mathbf{z}, \Theta)} \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; + \; C_{\mathbf{z}, \Theta}

\mathcal{F}_K(q) = \mathrm{ln} \, p(\mathbf{y} | \mathbf{z}, \Theta, K) - D_{KL}(q || p)

From this, it is clear that \mathcal{F}_K (ie. a lower-bound of the marginal log-likelihood) is the conditional log-likelihood minus the Kullback-Leibler divergence between the variational distribution q and the joint posterior of latent variables and parameters. As a side note, minimizing DKL(p | | q) is used in the expectation-propagation technique.

In practice, we have to make the following crucial assumption of independence on q_{\mathbf{z}, \Theta} in order for the calculations to be analytically tractable:

q_{\mathbf{z}, \Theta}(\mathbf{z}, \Theta) = q_{\mathbf{z}}(\mathbf{z}) q_\Theta(\Theta)

This means that q_\mathbf{z} q_\Theta approximates the joint posterior, and therefore the lower-bound will be tight if and only if this approximation is exact and the KL divergence is zero.

As we ultimately aim at inferring the parameters and latent variables that maximize the marginal log-likelihood, we will use the calculus of variations to find the functions q_\mathbf{z} and qΘ that maximize the functional \mathcal{F}_K.

\mathcal{F}_K(q_\mathbf{z}, q_\Theta) = \int_\Theta \; q_\Theta(\Theta) \; \left( \int_\mathbf{z} \; q_\mathbf{z}(\mathbf{z}) \; \mathrm{ln} \, \frac{p(\mathbf{y}, \mathbf{z} | \Theta, K)}{q_\mathbf{z}(\mathbf{z})} \, \mathrm{d}\mathbf{z} + \mathrm{ln} \, \frac{p(\Theta | K)}{q_\Theta(\Theta)} \right) \, \mathrm{d}\Theta \; + \; C_{\mathbf{z}} \; + \; C_{\Theta}

This naturally leads to a procedure very similar to the EM algorithm where, at the E step, we calculate the expectations of the parameters with respect to the variational distributions q_\mathbf{z} and qΘ, and, at the M step, we recompute the variational distributions over the parameters.


  • Updates for q_\mathbf{z}:

We start by writing the functional derivative of \mathcal{F}_K with respect to q_{\mathbf{z}}:

\frac{\partial \mathcal{F}_K}{\partial q_{\mathbf{z}}} = \int_\Theta \; q_\Theta(\Theta) \; \frac{\partial}{\partial q_{\mathbf{z}}} \left( \int_\mathbf{z} \; \left( q_{\mathbf{z}}(\mathbf{z}) \mathrm{ln} \, p(\mathbf{y},\mathbf{z}|\Theta,K) - q_{\mathbf{z}}(\mathbf{z}) \mathrm{ln} \, q_{\mathbf{z}}(\mathbf{z}) \right) \, \mathrm{d}\mathbf{z} \right) \, \mathrm{d}\Theta \; + \; C_{\mathbf{z}}

\frac{\partial \mathcal{F}_K}{\partial q_{\mathbf{z}}} = \int_\Theta \; q_\Theta(\Theta) \; \left( \mathrm{ln} \, p(\mathbf{y},\mathbf{z}|\Theta,K) - \mathrm{ln} \, q_{\mathbf{z}}(\mathbf{z}) - 1 \right) \, \mathrm{d}\Theta \; + \; C_{\mathbf{z}}

Then we set this functional derivative to zero. We also make use of a frequent assumption, namely that the variational distribution fully factorizes over each individual latent variables (mean-field assumption):

\frac{\partial \mathcal{F}_K}{\partial q_{\mathbf{z}}} \bigg|_{q_{\mathbf{z}}^{(t+1)}} = 0 \Longleftrightarrow \forall \, n \; \mathrm{ln} \, q_{z_n}^{(t+1)}(z_n) = \int_\Theta \; q_\Theta(\Theta) \; \mathrm{ln} \, p(y_n,z_n|\Theta,K) \, \mathrm{d}\Theta \; + \; C_{z_n}

Recognizing the expectation and factorizing qΘ(Θ) into q_\mathbf{w}(\mathbf{w})q_\mathbf{\mu,\tau}(\mathbf{\mu,\tau}), we get:

\mathrm{ln} \, q_{z_n}^{(t+1)}(z_n) = \mathbb{E}_\mathbf{w}[\mathrm{ln} \, p(z_n|\mathbf{w},K)] + \mathbb{E}_{\mathbf{\mu,\tau}}[\mathrm{ln} \, p(y_n|z_n,\mathbf{\mu},\mathbf{\tau},K)] \; + \; \text{constant}

\mathrm{ln} \, q_{z_n}^{(t+1)}(z_n) = \sum_{k=1}^K ( z_{nk} \; \mathrm{ln} \, \rho_{nk} ) \; + \; \text{constant} where \mathrm{ln} \, \rho_{nk} = \mathbb{E}[\mathrm{ln} \, w_k] + \frac{1}{2} \mathbb{E}[\mathrm{ln} \, \tau_k] - \frac{1}{2} \mathrm{ln} \, 2\pi - \frac{1}{2} \mathbb{E}[\tau_k (y_n-\mu_k)^2]

Taking the exponential: q_{z_n}^{(t+1)}(z_n) \propto \prod_k \rho_{nk}^{z_{nk}}

As this should be a distribution, it should sum to one, and therefore:

q_{z_n}^{(t+1)}(z_n) = \prod_k r_{nk}^{z_{nk}} where r_{nk} = \frac{\rho_{nk}}{\sum_{k'=1}^K \rho_{nk'}} ("r" stands for "reponsability")

Interestingly, even though we haven't specified anything yet about q_{z_n}, we can see that it is of the same form as the prior on zn, a Multinomial distribution.


  • Updates for qΘ:

We start by writing the functional derivative of \mathcal{F}_K with respect to qΘ:

\frac{\partial \mathcal{F}_K}{\partial q_\Theta} = \frac{\partial}{\partial q_\Theta} \left( \int_\Theta \; q_\Theta(\Theta) \; \left( \int_\mathbf{z} \; q_\mathbf{z}(\mathbf{z}) \; \mathrm{ln} \, p(\mathbf{y}, \mathbf{z} | \Theta, K) \, \mathrm{d}\mathbf{z} + \mathrm{ln} \, \frac{p(\Theta | K)}{q_\Theta(\Theta)} \right) \, \mathrm{d}\Theta \right) \; + \; C_{\Theta}

\frac{\partial \mathcal{F}_K}{\partial q_\Theta} = \int_\mathbf{z} \; q_\mathbf{z}(\mathbf{z}) \; \mathrm{ln} \, p(\mathbf{z} | \Theta, K) \; \mathrm{ln} \, p(\mathbf{y} | \mathbf{z}, \Theta, K) \, \mathrm{d}\mathbf{z} \; + \; \mathrm{ln} \, p(\Theta | K) - \mathrm{ln} \, q_\Theta(\Theta) \; - 1 \; + \; C_{\Theta}

Then, when setting this functional derivative to zero, we obtain:

\mathrm{ln} \, q_\Theta(\Theta)^{(t+1)} = \mathbb{E}_\mathbf{z}[\mathrm{ln} \, p(\mathbf{z} | \mathbf{w}, K)] \; + \; \sum_n \sum_k \mathbb{E}[z_{nk}] \mathrm{ln} \, p(y_n | \mu_k, \tau_k) \; + \; \mathrm{ln} \, p(\mathbf{w} | K) \; + \; \sum_k \mathrm{ln} \, p(\mu_k, \tau_k | K) \; + \; \text{constant}

Note how no term involves both \mathbf{w} and μkk. This naturally implies the factorization q_\Theta(\Theta) = q_\mathbf{w}(\mathbf{w})q_\mathbf{\mu,\tau}(\mathbf{\mu,\tau}). And we can also notice the following factorization q_\mathbf{\mu,\tau}(\mathbf{\mu,\tau}) = \prod_k q(\mu_k, \tau_k).

Starting with \mathbf{w}:

\mathrm{ln} \, q_\mathbf{w}^{(t+1)}(\mathbf{w}) = \sum_n \sum_k \mathbb{E}[z_{nk}] \, \mathrm{ln} \, w_k \; + \; (\gamma_0 - 1) \sum_k \mathrm{ln} \, w_k \; + \; \text{constant}

Recognizing \mathbb{E}[z_{nk}] = r_{nk} and taking the exponential, we get another Dirichlet distribution:

q_\mathbf{w}^{(t+1)}(\mathbf{w}) \propto \prod_k w_k^{\gamma_0-1+\sum_n r_{nk}}

that is q_\mathbf{w}^{(t+1)} \sim \mathcal{D}ir(\gamma^{(t+1)}) with \gamma_k^{(t+1)}=\gamma_0-1+\sum_n r_{nk}

And now, about the other parameters, we recognize a Normal-Gamma distribution.

q(\mu_k,\tau_k) = q(\tau_k) q(\mu_k | \tau_k) = \mathcal{N}\mathcal{G}a(\tilde{\mu}_k, \tilde{\tau}_k^{-1}, \tilde{\alpha}_k, \tilde{\beta}_k)

It's easier to first define three statistics of the data with respect to the responsabilities:

Nk = rnk
n

\bar{y}_k = \frac{1}{N_k} \sum_n r_{nk} y_n

S_k = \frac{1}{N_k} \sum_n r_{nk} (y_n - \bar{y}_k)^2.

This allows us to concisely write (TODO: check the algebra...):

\tilde{\tau}_k = \tau_0 + N_k

\tilde{\mu}_k = \frac{1}{\tilde{\tau}_k} (\tau_0 \mu_0 + N_k \bar{y}_k)

\tilde{\alpha}_k^{-1} = \alpha^{-1} + N_k S_k + \frac{\tau_0 N_k}{\tau_0 + N_k}(\bar{y}_k - \mu_0)^2

\tilde{\beta}_k = \beta + N_k + 1


  • Choice of K:

TODO


  • References:
    • book "Pattern Recognition and Machine Learning" from Christopher Bishop


Personal tools