User:Timothee Flutre/Notebook/Postdoc/2012/08/16
From OpenWetWare
m (→Variational Bayes approach for the mixture of Normals: fix error prior \mu_k + add link precision) |
(→Variational Bayes approach for the mixture of Normals: add principle of Variational Bayes) |
||
Line 18: | Line 18: | ||
* '''Observed likelihood''': <math>p(\mathbf{y} | \Theta, K) = \prod_{n=1}^N p(y_n|\Theta,K) = \prod_{n=1}^N \sum_{k=1}^K w_k Normal(y_n;\mu_k,\tau_k^{-1})</math> | * '''Observed likelihood''': <math>p(\mathbf{y} | \Theta, K) = \prod_{n=1}^N p(y_n|\Theta,K) = \prod_{n=1}^N \sum_{k=1}^K w_k Normal(y_n;\mu_k,\tau_k^{-1})</math> | ||
+ | |||
+ | |||
+ | * '''Maximizing the observed log-likelihood''': as shown [http://openwetware.org/wiki/User:Timothee_Flutre/Notebook/Postdoc/2011/12/14 here], maximizing the likelihood of a mixture model is like doing a weighted likelihood maximization. However, these weights depend on the parameters we want to estimate! That's why we now switch to the missing-data formulation of the mixture model. | ||
Line 26: | Line 29: | ||
- | * '''Priors''': we choose conjuguate | + | * '''Priors''': in the Bayesian paradigm, parameters and latent variables are random variables for which we want to infer the posterior distribution. To make the calculations possible, we choose for them prior distributions that are conjuguate with the form of the likelihood. |
** for the parameters: <math>\forall k \; \mu_k | \tau_k \sim Normal(\mu_0,(\tau_0 \tau_k)^{-1})</math> and <math>\forall k \; \tau_k \sim Gamma(\alpha,\beta)</math> | ** for the parameters: <math>\forall k \; \mu_k | \tau_k \sim Normal(\mu_0,(\tau_0 \tau_k)^{-1})</math> and <math>\forall k \; \tau_k \sim Gamma(\alpha,\beta)</math> | ||
** for the latent variables: <math>\forall n \; z_n \sim Multinomial_K(1,\mathbf{w})</math> and <math>\mathbf{w} \sim Dirichlet(\gamma)</math> | ** for the latent variables: <math>\forall n \; z_n \sim Multinomial_K(1,\mathbf{w})</math> and <math>\mathbf{w} \sim Dirichlet(\gamma)</math> | ||
+ | |||
+ | * '''Variational Bayes''': our primary goal here is to calculate the marginal log-likelihood of our data set: | ||
+ | |||
+ | <math>\mathrm{ln} \, p(\mathbf{y} | K) = \mathrm{ln} \, \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; p(\mathbf{y}, \mathbf{z}, \Theta | K)</math> | ||
+ | |||
+ | However the fact that there are latent variables induce dependencies between all the parameters of the model. | ||
+ | This makes it difficult to find the parameters that maximize the marginal likelihood. | ||
+ | An elegant solution is to introduce a "variational distribution" <math>q</math> of the parameters and the latent variables | ||
+ | |||
+ | <math>\mathrm{ln} \, p(\mathbf{y} | K) = \mathrm{ln} \, \left( \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; q(\mathbf{z}, \Theta) \; \frac{p(\mathbf{y}, \mathbf{z}, \Theta | K)}{q(\mathbf{z}, \Theta)} \right) + C_{\mathbf{z}, \Theta}</math> | ||
+ | |||
+ | The constant <math>C</math> is here to remind us that <math>q</math> has the constraint of being a distribution, ie. of summing to 1, which can be enforced by a Lagrange multiplier. | ||
+ | |||
+ | The '''crucial assumption''' is to assume the independence of the parameters and the latent variables: | ||
+ | |||
+ | <math>q(\mathbf{z}, \Theta) = q(\mathbf{z}) q(\Theta)</math> | ||
+ | |||
+ | We can then use the concavity of the logarithm and Jensen's inequality to optimize a lower bound of the marginal log-likelihood: | ||
+ | |||
+ | <math>\mathrm{ln} \, p(\mathbf{y} | K) \ge \int_\Theta \, \mathrm{d}\Theta \; \left( \int_\mathbf{z} \, \mathrm{d}\mathbf{z} \; q(\mathbf{z}) \; \mathrm{ln} \, \frac{p(\mathbf{y}, \mathbf{z} | \Theta, K)}{q(\mathbf{z})} + \mathrm{ln} \, \frac{p(\Theta | K)}{q(\Theta)} \right) + C_{\mathbf{z}} + C_{\Theta}</math> | ||
+ | |||
+ | Now we have to optimize the right-hand side of the inequality. Let's name it <math>\mathcal{F}</math> as it is a [http://en.wikipedia.org/wiki/Functional_%28mathematics%29 functional], ie. a ''function of functions''. Using the [http://en.wikipedia.org/wiki/Calculus_of_variations calculus of variations], we'll find the function <math>q</math> that maximizes it. | ||
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> |
Revision as of 15:29, 31 August 2012
Project name | Main project page Previous entry Next entry |
Variational Bayes approach for the mixture of Normals
However the fact that there are latent variables induce dependencies between all the parameters of the model. This makes it difficult to find the parameters that maximize the marginal likelihood. An elegant solution is to introduce a "variational distribution" q of the parameters and the latent variables
The constant C is here to remind us that q has the constraint of being a distribution, ie. of summing to 1, which can be enforced by a Lagrange multiplier. The crucial assumption is to assume the independence of the parameters and the latent variables:
We can then use the concavity of the logarithm and Jensen's inequality to optimize a lower bound of the marginal log-likelihood:
Now we have to optimize the right-hand side of the inequality. Let's name it as it is a functional, ie. a function of functions. Using the calculus of variations, we'll find the function q that maximizes it. |