User:Timothee Flutre/Notebook/Postdoc/2012/08/16

From OpenWetWare

(Difference between revisions)
Jump to: navigation, search
(Variational Bayes approach for the mixture of Normals: add part of updates for q_z)
(Variational Bayes approach for the mixture of Normals: improve notation)
Line 23: Line 23:
* '''Observed likelihood''':  observations assumed exchangeable (independent and identically distributed given the parameters)
* '''Observed likelihood''':  observations assumed exchangeable (independent and identically distributed given the parameters)
-
<math>p(\mathbf{y} | \Theta, K) = \prod_{n=1}^N p(y_n|\Theta,K) = \prod_{n=1}^N \sum_{k=1}^K w_k Normal(y_n;\mu_k,\tau_k^{-1})</math>
+
<math>p(\mathbf{y} | \Theta, K) = \prod_{n=1}^N p(y_n|\Theta,K) = \prod_{n=1}^N \sum_{k=1}^K w_k \; \mathcal{N}(y_n;\mu_k,\tau_k^{-1})</math>
* '''Latent variables''': N hidden variables, <math>z_1,\ldots,z_N</math>, each being a vector of length K with a single 1 indicating the component to which the <math>n^{th}</math> observation belongs, and K-1 zeroes.
* '''Latent variables''': N hidden variables, <math>z_1,\ldots,z_N</math>, each being a vector of length K with a single 1 indicating the component to which the <math>n^{th}</math> observation belongs, and K-1 zeroes.
Line 29: Line 29:
* '''Augmented likelihood''':
* '''Augmented likelihood''':
-
<math>p(\mathbf{y},\mathbf{z}|\Theta,K) = \prod_{n=1}^N p(y_n,z_n|\Theta,K) = \prod_{n=1}^N p(z_n|\Theta,K) p(y_n|z_n,\Theta,K) = \prod_{n=1}^N \prod_{k=1}^K w_k^{z_{nk}} Normal(y_n;\mu_k,\tau_k^{-1})^{z_{nk}}</math>
+
<math>p(\mathbf{y},\mathbf{z}|\Theta,K) = \prod_{n=1}^N p(y_n,z_n|\Theta,K) = \prod_{n=1}^N p(z_n|\Theta,K) p(y_n|z_n,\Theta,K) = \prod_{n=1}^N \prod_{k=1}^K w_k^{z_{nk}} \; \mathcal{N}(y_n;\mu_k,\tau_k^{-1})^{z_{nk}}</math>
* '''Maximum-likelihood estimation''': integrate out the latent variables
* '''Maximum-likelihood estimation''': integrate out the latent variables
Line 41: Line 41:
* '''Priors''': conjuguate
* '''Priors''': conjuguate
-
** <math>\forall k \; \mu_k | \tau_k \sim Normal(\mu_0,(\tau_0 \tau_k)^{-1})</math> and <math>\forall k \; \tau_k \sim Gamma(\alpha,\beta)</math>
+
** <math>\forall k \; \mu_k | \tau_k \sim \mathcal{N}(\mu_0,(\tau_0 \tau_k)^{-1})</math> and <math>\forall k \; \tau_k \sim \mathcal{G}a(\alpha,\beta)</math>
-
** <math>\forall n \; z_n \sim Multinomial_K(1,\mathbf{w})</math> and <math>\mathbf{w} \sim Dirichlet(\gamma)</math>
+
** <math>\forall n \; z_n \sim \mathcal{M}ult_K(1,\mathbf{w})</math> and <math>\mathbf{w} \sim \mathcal{D}ir(\gamma)</math>
Line 55: Line 55:
We can then use the concavity of the logarithm ([http://en.wikipedia.org/wiki/Jensen%27s_inequality Jensen's inequality]) to derive a lower bound of the marginal log-likelihood:
We can then use the concavity of the logarithm ([http://en.wikipedia.org/wiki/Jensen%27s_inequality Jensen's inequality]) to derive a lower bound of the marginal log-likelihood:
-
<math>\mathrm{ln} \, p(\mathbf{y} | K) \ge \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; q(\mathbf{z}, \Theta) \; \mathrm{ln} \, \frac{p(\mathbf{y}, \mathbf{z}, \Theta | K)}{q(\mathbf{z}, \Theta)} + C_{\mathbf{z}, \Theta} = \mathcal{F}(q)</math>
+
<math>\mathrm{ln} \, p(\mathbf{y} | K) \ge \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; q(\mathbf{z}, \Theta) \; \mathrm{ln} \, \frac{p(\mathbf{y}, \mathbf{z}, \Theta | K)}{q(\mathbf{z}, \Theta)} + C_{\mathbf{z}, \Theta} = \mathcal{F}_K(q)</math>
-
Let's call this lower bound <math>\mathcal{F}(q)</math> as it is a [http://en.wikipedia.org/wiki/Functional_%28mathematics%29 functional], ie. a ''function of functions''. To gain some intuition about the impact of introducing <math>q</math>, let's expand <math>\mathcal{F}</math>:
+
Let's call this lower bound <math>\mathcal{F}_K(q)</math> as it is a [http://en.wikipedia.org/wiki/Functional_%28mathematics%29 functional], ie. a ''function of functions''. To gain some intuition about the impact of introducing <math>q</math>, let's expand <math>\mathcal{F}_K</math>:
-
<math>\mathcal{F}(q) = \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; q(\mathbf{z}, \Theta) \; \mathrm{ln} \, p(\mathbf{y} | \mathbf{z}, \Theta, K) + \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; q(\mathbf{z}, \Theta) \; \mathrm{ln} \, \frac{p(\mathbf{z}, \Theta | K)}{q(\mathbf{z}, \Theta)} + C_{\mathbf{z}, \Theta}</math>
+
<math>\mathcal{F}_K(q) = \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; q(\mathbf{z}, \Theta) \; \mathrm{ln} \, p(\mathbf{y} | \mathbf{z}, \Theta, K) + \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; q(\mathbf{z}, \Theta) \; \mathrm{ln} \, \frac{p(\mathbf{z}, \Theta | K)}{q(\mathbf{z}, \Theta)} + C_{\mathbf{z}, \Theta}</math>
-
<math>\mathcal{F}(q) = \mathrm{ln} \, p(\mathbf{y} | \mathbf{z}, \Theta, K) - D_{KL}(q || p)</math>
+
<math>\mathcal{F}_K(q) = \mathrm{ln} \, p(\mathbf{y} | \mathbf{z}, \Theta, K) - D_{KL}(q || p)</math>
-
From this, it is clear that <math>\mathcal{F}</math> (ie. a lower-bound of the marginal log-likelihood) is the conditional log-likelihood minus the [http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence Kullback-Leibler divergence] between the variational distribution <math>q</math> and the joint posterior of latent variables and parameters.
+
From this, it is clear that <math>\mathcal{F}_K</math> (ie. a lower-bound of the marginal log-likelihood) is the conditional log-likelihood minus the [http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence Kullback-Leibler divergence] between the variational distribution <math>q</math> and the joint posterior of latent variables and parameters.
In practice, we have to make the following crucial assumption of independence on <math>q</math> in order for the calculations to be analytically tractable:
In practice, we have to make the following crucial assumption of independence on <math>q</math> in order for the calculations to be analytically tractable:
Line 71: Line 71:
This means that <math>q_\mathbf{z} q_\Theta</math> approximates the joint posterior, and therefore the lower-bound will be tight if and only if this approximation is exact and the KL divergence is zero.
This means that <math>q_\mathbf{z} q_\Theta</math> approximates the joint posterior, and therefore the lower-bound will be tight if and only if this approximation is exact and the KL divergence is zero.
-
As we ultimately aim at inferring the parameters and latent variables that maximize the marginal log-likelihood, we will use the [http://en.wikipedia.org/wiki/Calculus_of_variations calculus of variations] to find the functions <math>q_\mathbf{z}</math> and <math>q_\Theta</math> that maximize the functional <math>\mathcal{F}</math>.
+
As we ultimately aim at inferring the parameters and latent variables that maximize the marginal log-likelihood, we will use the [http://en.wikipedia.org/wiki/Calculus_of_variations calculus of variations] to find the functions <math>q_\mathbf{z}</math> and <math>q_\Theta</math> that maximize the functional <math>\mathcal{F}_K</math>.
-
<math>\mathcal{F}(q_\mathbf{z}, q_\Theta) = \int_\Theta \, \mathrm{d}\Theta \; q_\Theta(\Theta) \; \left( \int_\mathbf{z} \, \mathrm{d}\mathbf{z} \; q(\mathbf{z}) \; \mathrm{ln} \, \frac{p(\mathbf{y}, \mathbf{z} | \Theta, K)}{q(\mathbf{z})} + \mathrm{ln} \, \frac{p(\Theta | K)}{q(\Theta)} \right) + C_{\mathbf{z}} + C_{\Theta}</math>
+
<math>\mathcal{F}_K(q_\mathbf{z}, q_\Theta) = \int_\Theta \, \mathrm{d}\Theta \; q_\Theta(\Theta) \; \left( \int_\mathbf{z} \, \mathrm{d}\mathbf{z} \; q_\mathbf{z}(\mathbf{z}) \; \mathrm{ln} \, \frac{p(\mathbf{y}, \mathbf{z} | \Theta, K)}{q_\mathbf{z}(\mathbf{z})} + \mathrm{ln} \, \frac{p(\Theta | K)}{q_\Theta(\Theta)} \right) + C_{\mathbf{z}} + C_{\Theta}</math>
This naturally leads to a procedure very similar to the EM algorithm where, at the E step, we calculate the expectations of the parameters with respect to the variational distributions <math>q_\mathbf{z}</math> and <math>q_\Theta</math>, and, at the M step, we recompute the variational distributions over the parameters.
This naturally leads to a procedure very similar to the EM algorithm where, at the E step, we calculate the expectations of the parameters with respect to the variational distributions <math>q_\mathbf{z}</math> and <math>q_\Theta</math>, and, at the M step, we recompute the variational distributions over the parameters.
Line 80: Line 80:
* '''Updates for <math>q_\mathbf{z}</math>''':
* '''Updates for <math>q_\mathbf{z}</math>''':
-
We start by writing the functional derivative of <math>\mathcal{F}</math> with respect to <math>q_{\mathbf{z}}</math>:
+
We start by writing the functional derivative of <math>\mathcal{F}_K</math> with respect to <math>q_{\mathbf{z}}</math>:
-
<math>\frac{\partial \mathcal{F}}{\partial q_{\mathbf{z}}} = \int_\Theta \, \mathrm{d}\Theta \; q_\Theta(\Theta) \; \left[ \frac{\partial}{\partial q_{\mathbf{z}}} \left( \int_\mathbf{z} \, \mathrm{d}\mathbf{z} \; \left( q_{\mathbf{z}}(\mathbf{z}) \mathrm{ln} \, p(\mathbf{y},\mathbf{z}|\Theta,K) - q_{\mathbf{z}}(\mathbf{z}) \mathrm{ln} \, q_{\mathbf{z}}(\mathbf{z}) \right) \right) \right] + C_{\mathbf{z}}</math>
+
<math>\frac{\partial \mathcal{F}_K}{\partial q_{\mathbf{z}}} = \int_\Theta \, \mathrm{d}\Theta \; q_\Theta(\Theta) \; \left[ \frac{\partial}{\partial q_{\mathbf{z}}} \left( \int_\mathbf{z} \, \mathrm{d}\mathbf{z} \; \left( q_{\mathbf{z}}(\mathbf{z}) \mathrm{ln} \, p(\mathbf{y},\mathbf{z}|\Theta,K) - q_{\mathbf{z}}(\mathbf{z}) \mathrm{ln} \, q_{\mathbf{z}}(\mathbf{z}) \right) \right) \right] + C_{\mathbf{z}}</math>
-
<math>\frac{\partial \mathcal{F}}{\partial q_{\mathbf{z}}} = \int_\Theta \, \mathrm{d}\Theta \; q_\Theta(\Theta) \; \left[ \mathrm{ln} \, p(\mathbf{y},\mathbf{z}|\Theta,K) - \mathrm{ln} \, q_{\mathbf{z}}(\mathbf{z}) - 1 \right] + C_{\mathbf{z}}</math>
+
<math>\frac{\partial \mathcal{F}_K}{\partial q_{\mathbf{z}}} = \int_\Theta \, \mathrm{d}\Theta \; q_\Theta(\Theta) \; \left[ \mathrm{ln} \, p(\mathbf{y},\mathbf{z}|\Theta,K) - \mathrm{ln} \, q_{\mathbf{z}}(\mathbf{z}) - 1 \right] + C_{\mathbf{z}}</math>
-
Then we set this functional derivative to zero. We also make use of a frequent assumption, namely that the variational distribution fully factorizes over each individual latent variables (mean-field assumption):
+
Then we set this functional derivative to zero. We also make use of a frequent assumption, namely that the variational distribution fully factorizes over each individual latent variables ([http://en.wikipedia.org/wiki/Mean_field_theory mean-field assumption]):
-
<math>\frac{\partial \mathcal{F}}{\partial q_{\mathbf{z}}} \bigg|_{q_{\mathbf{z}}^{(t+1)}} = 0 \Longleftrightarrow \forall \, n \; \mathrm{ln} \, q_{z_n}^{(t+1)}(z_n) = \left( \int_\Theta \, \mathrm{d}\Theta \; q_\Theta(\Theta) \; \mathrm{ln} \, p(\mathbf{y},\mathbf{z}|\Theta,K) \right) - 1 + C_{z_n}</math>
+
<math>\frac{\partial \mathcal{F}_K}{\partial q_{\mathbf{z}}} \bigg|_{q_{\mathbf{z}}^{(t+1)}} = 0 \Longleftrightarrow \forall \, n \; \mathrm{ln} \, q_{z_n}^{(t+1)}(z_n) = \left( \int_\Theta \, \mathrm{d}\Theta \; q_\Theta(\Theta) \; \mathrm{ln} \, p(\mathbf{y},\mathbf{z}|\Theta,K) \right) - 1 + C_{z_n}</math>
TODO
TODO
Line 94: Line 94:
* '''Updates for <math>q_\Theta</math>''':
* '''Updates for <math>q_\Theta</math>''':
 +
 +
TODO
 +
 +
 +
* '''Choice of K''':
TODO
TODO

Revision as of 18:47, 23 May 2013

Project name Main project page
Previous entry      Next entry

Variational Bayes approach for the mixture of Normals

  • Motivation: I have described on another page the basics of mixture models and the EM algorithm in a frequentist context. It is worth reading before continuing. Here I am interested in the Bayesian approach as well as in a specific variational method (nicknamed "Variational Bayes").


  • Data: N univariate observations, y_1, \ldots, y_N, gathered into the vector \mathbf{y}
  • Model: mixture of K Normal distributions
  • Parameters: K mixture weights (wk), K means (μk) and K precisions (τk), one per mixture component

\Theta = \{w_1,\ldots,w_K,\mu_1,\ldots,\mu_K,\tau_1,\ldots,\tau_K\}

  • Constraints: \sum_{k=1}^K w_k = 1 and \forall k \; w_k > 0.
  • Observed likelihood: observations assumed exchangeable (independent and identically distributed given the parameters)

p(\mathbf{y} | \Theta, K) = \prod_{n=1}^N p(y_n|\Theta,K) = \prod_{n=1}^N \sum_{k=1}^K w_k \; \mathcal{N}(y_n;\mu_k,\tau_k^{-1})

  • Latent variables: N hidden variables, z_1,\ldots,z_N, each being a vector of length K with a single 1 indicating the component to which the nth observation belongs, and K-1 zeroes.
  • Augmented likelihood:

p(\mathbf{y},\mathbf{z}|\Theta,K) = \prod_{n=1}^N p(y_n,z_n|\Theta,K) = \prod_{n=1}^N p(z_n|\Theta,K) p(y_n|z_n,\Theta,K) = \prod_{n=1}^N \prod_{k=1}^K w_k^{z_{nk}} \; \mathcal{N}(y_n;\mu_k,\tau_k^{-1})^{z_{nk}}

  • Maximum-likelihood estimation: integrate out the latent variables

\mathrm{ln} \, p(\mathbf{y} | \Theta, K) = \sum_{n=1}^N \mathrm{ln} \, \int_{z_n} \mathrm{d}{z_n} \; p(y_n, z_n | \Theta, K)

The latent variables induce dependencies between all the parameters of the model. This makes it difficult to find the parameters that maximize the likelihood. An elegant solution is to introduce a variational distribution of parameters and latent variables, which leads to a re-formulation of the classical EM algorithm. But let's show it directly in the Bayesian paradigm.

  • Priors: conjuguate
    • \forall k \; \mu_k | \tau_k \sim \mathcal{N}(\mu_0,(\tau_0 \tau_k)^{-1}) and \forall k \; \tau_k \sim \mathcal{G}a(\alpha,\beta)
    • \forall n \; z_n \sim \mathcal{M}ult_K(1,\mathbf{w}) and \mathbf{w} \sim \mathcal{D}ir(\gamma)


  • Variational Bayes: let's focus here on calculating the marginal log-likelihood of our data set in order to perform model comparison:

\mathrm{ln} \, p(\mathbf{y} | K) = \mathrm{ln} \, \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; p(\mathbf{y}, \mathbf{z}, \Theta | K)

\mathrm{ln} \, p(\mathbf{y} | K) = \mathrm{ln} \, \left( \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; q(\mathbf{z}, \Theta) \; \frac{p(\mathbf{y}, \mathbf{z}, \Theta | K)}{q(\mathbf{z}, \Theta)} \right) + C_{\mathbf{z}, \Theta}

The constant C is here to remind us that q has the constraint of being a distribution, ie. of summing to 1, which can be enforced by a Lagrange multiplier.

We can then use the concavity of the logarithm (Jensen's inequality) to derive a lower bound of the marginal log-likelihood:

\mathrm{ln} \, p(\mathbf{y} | K) \ge \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; q(\mathbf{z}, \Theta) \; \mathrm{ln} \, \frac{p(\mathbf{y}, \mathbf{z}, \Theta | K)}{q(\mathbf{z}, \Theta)} + C_{\mathbf{z}, \Theta} = \mathcal{F}_K(q)

Let's call this lower bound \mathcal{F}_K(q) as it is a functional, ie. a function of functions. To gain some intuition about the impact of introducing q, let's expand \mathcal{F}_K:

\mathcal{F}_K(q) = \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; q(\mathbf{z}, \Theta) \; \mathrm{ln} \, p(\mathbf{y} | \mathbf{z}, \Theta, K) + \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; q(\mathbf{z}, \Theta) \; \mathrm{ln} \, \frac{p(\mathbf{z}, \Theta | K)}{q(\mathbf{z}, \Theta)} + C_{\mathbf{z}, \Theta}

\mathcal{F}_K(q) = \mathrm{ln} \, p(\mathbf{y} | \mathbf{z}, \Theta, K) - D_{KL}(q || p)

From this, it is clear that \mathcal{F}_K (ie. a lower-bound of the marginal log-likelihood) is the conditional log-likelihood minus the Kullback-Leibler divergence between the variational distribution q and the joint posterior of latent variables and parameters.

In practice, we have to make the following crucial assumption of independence on q in order for the calculations to be analytically tractable:

q(\mathbf{z}, \Theta) = q_{\mathbf{z}}(\mathbf{z}) q_\Theta(\Theta)

This means that q_\mathbf{z} q_\Theta approximates the joint posterior, and therefore the lower-bound will be tight if and only if this approximation is exact and the KL divergence is zero.

As we ultimately aim at inferring the parameters and latent variables that maximize the marginal log-likelihood, we will use the calculus of variations to find the functions q_\mathbf{z} and qΘ that maximize the functional \mathcal{F}_K.

\mathcal{F}_K(q_\mathbf{z}, q_\Theta) = \int_\Theta \, \mathrm{d}\Theta \; q_\Theta(\Theta) \; \left( \int_\mathbf{z} \, \mathrm{d}\mathbf{z} \; q_\mathbf{z}(\mathbf{z}) \; \mathrm{ln} \, \frac{p(\mathbf{y}, \mathbf{z} | \Theta, K)}{q_\mathbf{z}(\mathbf{z})} + \mathrm{ln} \, \frac{p(\Theta | K)}{q_\Theta(\Theta)} \right) + C_{\mathbf{z}} + C_{\Theta}

This naturally leads to a procedure very similar to the EM algorithm where, at the E step, we calculate the expectations of the parameters with respect to the variational distributions q_\mathbf{z} and qΘ, and, at the M step, we recompute the variational distributions over the parameters.


  • Updates for q_\mathbf{z}:

We start by writing the functional derivative of \mathcal{F}_K with respect to q_{\mathbf{z}}:

\frac{\partial \mathcal{F}_K}{\partial q_{\mathbf{z}}} = \int_\Theta \, \mathrm{d}\Theta \; q_\Theta(\Theta) \; \left[ \frac{\partial}{\partial q_{\mathbf{z}}} \left( \int_\mathbf{z} \, \mathrm{d}\mathbf{z} \; \left( q_{\mathbf{z}}(\mathbf{z}) \mathrm{ln} \, p(\mathbf{y},\mathbf{z}|\Theta,K) - q_{\mathbf{z}}(\mathbf{z}) \mathrm{ln} \, q_{\mathbf{z}}(\mathbf{z}) \right) \right) \right] + C_{\mathbf{z}}

\frac{\partial \mathcal{F}_K}{\partial q_{\mathbf{z}}} = \int_\Theta \, \mathrm{d}\Theta \; q_\Theta(\Theta) \; \left[ \mathrm{ln} \, p(\mathbf{y},\mathbf{z}|\Theta,K) - \mathrm{ln} \, q_{\mathbf{z}}(\mathbf{z}) - 1 \right] + C_{\mathbf{z}}

Then we set this functional derivative to zero. We also make use of a frequent assumption, namely that the variational distribution fully factorizes over each individual latent variables (mean-field assumption):

\frac{\partial \mathcal{F}_K}{\partial q_{\mathbf{z}}} \bigg|_{q_{\mathbf{z}}^{(t+1)}} = 0 \Longleftrightarrow \forall \, n \; \mathrm{ln} \, q_{z_n}^{(t+1)}(z_n) = \left( \int_\Theta \, \mathrm{d}\Theta \; q_\Theta(\Theta) \; \mathrm{ln} \, p(\mathbf{y},\mathbf{z}|\Theta,K) \right) - 1 + C_{z_n}

TODO


  • Updates for qΘ:

TODO


  • Choice of K:

TODO


Personal tools