User:Timothee Flutre/Notebook/Postdoc/2012/08/16
From OpenWetWare
(→Variational Bayes approach for the mixture of Normals: improve notation) 
Current revision (19:28, 5 August 2013) (view source) (→Variational Bayes approach for the mixture of Normals: finish update q_mu,tau) 

(5 intermediate revisions not shown.)  
Line 26:  Line 26:  
* '''Latent variables''': N hidden variables, <math>z_1,\ldots,z_N</math>, each being a vector of length K with a single 1 indicating the component to which the <math>n^{th}</math> observation belongs, and K1 zeroes.  * '''Latent variables''': N hidden variables, <math>z_1,\ldots,z_N</math>, each being a vector of length K with a single 1 indicating the component to which the <math>n^{th}</math> observation belongs, and K1 zeroes.  
+  
+  <math>p(\mathbf{z}\mathbf{w},K) = \prod_{n=1}^N p(z_n\mathbf{w},K) = \prod_{n=1}^N \prod_{k=1}^K w_k^{z_{nk}}</math>  
* '''Augmented likelihood''':  * '''Augmented likelihood''':  
Line 41:  Line 43:  
* '''Priors''': conjuguate  * '''Priors''': conjuguate  
+  ** <math>p(\Theta  K) = p(\mathbf{w}  K) \prod_k p(\tau_k) p(\mu_k  \tau_k)</math>  
** <math>\forall k \; \tau_k \sim \mathcal{G}a(\alpha,\beta)</math> and <math>\forall k \; \mu_k  \tau_k \sim \mathcal{N}(\mu_0,(\tau_0 \tau_k)^{1})</math>  ** <math>\forall k \; \tau_k \sim \mathcal{G}a(\alpha,\beta)</math> and <math>\forall k \; \mu_k  \tau_k \sim \mathcal{N}(\mu_0,(\tau_0 \tau_k)^{1})</math>  
  ** <math>\forall n \; z_n \sim \mathcal{M}ult_K(1,\mathbf{w})</math> and <math>\mathbf{w} \sim \mathcal{D}ir(\  +  ** <math>\forall n \; z_n \sim \mathcal{M}ult_K(1,\mathbf{w})</math> and <math>\mathbf{w} \sim \mathcal{D}ir(\gamma_0)</math> 
Line 91:  Line 94:  
Then we set this functional derivative to zero. We also make use of a frequent assumption, namely that the variational distribution fully factorizes over each individual latent variables ([http://en.wikipedia.org/wiki/Mean_field_theory meanfield assumption]):  Then we set this functional derivative to zero. We also make use of a frequent assumption, namely that the variational distribution fully factorizes over each individual latent variables ([http://en.wikipedia.org/wiki/Mean_field_theory meanfield assumption]):  
  <math>\frac{\partial \mathcal{F}_K}{\partial q_{\mathbf{z}}} \bigg_{q_{\mathbf{z}}^{(t+1)}} = 0 \Longleftrightarrow \forall \, n \; \mathrm{ln} \, q_{z_n}^{(t+1)}(z_n) = \int_\Theta \; q_\Theta(\Theta) \; \mathrm{ln} \, p(  +  <math>\frac{\partial \mathcal{F}_K}{\partial q_{\mathbf{z}}} \bigg_{q_{\mathbf{z}}^{(t+1)}} = 0 \Longleftrightarrow \forall \, n \; \mathrm{ln} \, q_{z_n}^{(t+1)}(z_n) = \int_\Theta \; q_\Theta(\Theta) \; \mathrm{ln} \, p(y_n,z_n\Theta,K) \, \mathrm{d}\Theta \; + \; C_{z_n}</math> 
  +  Recognizing the expectation and factorizing <math>q_\Theta(\Theta)</math> into <math>q_\mathbf{w}(\mathbf{w})q_\mathbf{\mu,\tau}(\mathbf{\mu,\tau})</math>, we get:  
+  
+  <math>\mathrm{ln} \, q_{z_n}^{(t+1)}(z_n) = \mathbb{E}_\mathbf{w}[\mathrm{ln} \, p(z_n\mathbf{w},K)] + \mathbb{E}_{\mathbf{\mu,\tau}}[\mathrm{ln} \, p(y_nz_n,\mathbf{\mu},\mathbf{\tau},K)] \; + \; \text{constant}</math>  
+  
+  <math>\mathrm{ln} \, q_{z_n}^{(t+1)}(z_n) = \sum_{k=1}^K ( z_{nk} \; \mathrm{ln} \, \rho_{nk} ) \; + \; \text{constant}</math> where <math>\mathrm{ln} \, \rho_{nk} = \mathbb{E}[\mathrm{ln} \, w_k] + \frac{1}{2} \mathbb{E}[\mathrm{ln} \, \tau_k]  \frac{1}{2} \mathrm{ln} \, 2\pi  \frac{1}{2} \mathbb{E}[\tau_k (y_n\mu_k)^2]</math>  
+  
+  Taking the exponential: <math>q_{z_n}^{(t+1)}(z_n) \propto \prod_k \rho_{nk}^{z_{nk}}</math>  
+  
+  As this should be a distribution, it should sum to one, and therefore:  
+  
+  <math>q_{z_n}^{(t+1)}(z_n) = \prod_k r_{nk}^{z_{nk}}</math> where <math>r_{nk} = \frac{\rho_{nk}}{\sum_{k'=1}^K \rho_{nk'}}</math> ("r" stands for "reponsability")  
+  
+  Interestingly, even though we haven't specified anything yet about <math>q_{z_n}</math>, we can see that it is of the same form as the prior on <math>z_n</math>, a Multinomial distribution.  
* '''Updates for <math>q_\Theta</math>''':  * '''Updates for <math>q_\Theta</math>''':  
  TODO  +  We start by writing the functional derivative of <math>\mathcal{F}_K</math> with respect to <math>q_{\Theta}</math>: 
+  
+  <math>\frac{\partial \mathcal{F}_K}{\partial q_\Theta} = \frac{\partial}{\partial q_\Theta} \left( \int_\Theta \; q_\Theta(\Theta) \; \left( \int_\mathbf{z} \; q_\mathbf{z}(\mathbf{z}) \; \mathrm{ln} \, p(\mathbf{y}, \mathbf{z}  \Theta, K) \, \mathrm{d}\mathbf{z} + \mathrm{ln} \, \frac{p(\Theta  K)}{q_\Theta(\Theta)} \right) \, \mathrm{d}\Theta \right) \; + \; C_{\Theta}</math>  
+  
+  <math>\frac{\partial \mathcal{F}_K}{\partial q_\Theta} = \int_\mathbf{z} \; q_\mathbf{z}(\mathbf{z}) \; \mathrm{ln} \, p(\mathbf{z}  \Theta, K) \; \mathrm{ln} \, p(\mathbf{y}  \mathbf{z}, \Theta, K) \, \mathrm{d}\mathbf{z} \; + \; \mathrm{ln} \, p(\Theta  K)  \mathrm{ln} \, q_\Theta(\Theta) \;  1 \; + \; C_{\Theta}</math>  
+  
+  Then, when setting this functional derivative to zero, we obtain:  
+  
+  <math>\mathrm{ln} \, q_\Theta(\Theta)^{(t+1)} = \mathbb{E}_\mathbf{z}[\mathrm{ln} \, p(\mathbf{z}  \mathbf{w}, K)] \; + \; \sum_n \sum_k \mathbb{E}[z_{nk}] \mathrm{ln} \, p(y_n  \mu_k, \tau_k) \; + \; \mathrm{ln} \, p(\mathbf{w}  K) \; + \; \sum_k \mathrm{ln} \, p(\mu_k, \tau_k  K) \; + \; \text{constant}</math>  
+  
+  Note how no term involves both <math>\mathbf{w}</math> and <math>\mu_k,\tau_k</math>.  
+  This naturally implies the factorization <math>q_\Theta(\Theta) = q_\mathbf{w}(\mathbf{w})q_\mathbf{\mu,\tau}(\mathbf{\mu,\tau})</math>.  
+  And we can also notice the following factorization <math>q_\mathbf{\mu,\tau}(\mathbf{\mu,\tau}) = \prod_k q(\mu_k, \tau_k)</math>.  
+  
+  Starting with <math>\mathbf{w}</math>:  
+  
+  <math>\mathrm{ln} \, q_\mathbf{w}^{(t+1)}(\mathbf{w}) = \sum_n \sum_k \mathbb{E}[z_{nk}] \, \mathrm{ln} \, w_k \; + \; (\gamma_0  1) \sum_k \mathrm{ln} \, w_k \; + \; \text{constant}</math>  
+  
+  Recognizing <math>\mathbb{E}[z_{nk}] = r_{nk}</math> and taking the exponential, we get another Dirichlet distribution:  
+  
+  <math>q_\mathbf{w}^{(t+1)}(\mathbf{w}) \propto \prod_k w_k^{\gamma_01+\sum_n r_{nk}}</math>  
+  
+  that is <math>q_\mathbf{w}^{(t+1)} \sim \mathcal{D}ir(\gamma^{(t+1)})</math> with <math>\gamma_k^{(t+1)}=\gamma_01+\sum_n r_{nk}</math>  
+  
+  And now, about the other parameters, we recognize a NormalGamma distribution.  
+  
+  <math>q(\mu_k,\tau_k) = q(\tau_k) q(\mu_k  \tau_k) = \mathcal{N}\mathcal{G}a(\tilde{\mu}_k, \tilde{\tau}_k^{1}, \tilde{\alpha}_k, \tilde{\beta}_k)</math>  
+  
+  It's easier to first define three statistics of the data with respect to the responsabilities:  
+  
+  <math>N_k = \sum_n r_{nk}</math>  
+  
+  <math>\bar{y}_k = \frac{1}{N_k} \sum_n r_{nk} y_n</math>  
+  
+  <math>S_k = \frac{1}{N_k} \sum_n r_{nk} (y_n  \bar{y}_k)^2</math>.  
+  
+  This allows us to concisely write (TODO: check the algebra...):  
+  
+  <math>\tilde{\tau}_k = \tau_0 + N_k</math>  
+  
+  <math>\tilde{\mu}_k = \frac{1}{\tilde{\tau}_k} (\tau_0 \mu_0 + N_k \bar{y}_k)</math>  
+  
+  <math>\tilde{\alpha}_k^{1} = \alpha^{1} + N_k S_k + \frac{\tau_0 N_k}{\tau_0 + N_k}(\bar{y}_k  \mu_0)^2</math>  
+  
+  <math>\tilde{\beta}_k = \beta + N_k + 1</math>  
Line 104:  Line 163:  
TODO  TODO  
+  
+  
+  * '''References''':  
+  ** book "Pattern Recognition and Machine Learning" from Christopher Bishop  
<! ##### DO NOT edit below this line unless you know what you are doing. ##### >  <! ##### DO NOT edit below this line unless you know what you are doing. ##### > 
Current revision
Project name  Main project page Previous entry Next entry  
Variational Bayes approach for the mixture of Normals
The latent variables induce dependencies between all the parameters of the model. This makes it difficult to find the parameters that maximize the likelihood. An elegant solution is to introduce a variational distribution of parameters and latent variables, which leads to a reformulation of the classical EM algorithm. But let's show it directly in the Bayesian paradigm.
We can now introduce a distribution :
The constant is here to remind us that has the constraint of being a distribution, ie. of summing to 1, which can be enforced by a Lagrange multiplier. We can then use the concavity of the logarithm (Jensen's inequality) to derive a lower bound of the marginal loglikelihood:
Let's call this lower bound as it is a functional, ie. a function of functions. To gain some intuition about the impact of introducing q, let's expand :
From this, it is clear that (ie. a lowerbound of the marginal loglikelihood) is the conditional loglikelihood minus the KullbackLeibler divergence between the variational distribution q and the joint posterior of latent variables and parameters. As a side note, minimizing D_{KL}(p   q) is used in the expectationpropagation technique. In practice, we have to make the following crucial assumption of independence on in order for the calculations to be analytically tractable:
This means that approximates the joint posterior, and therefore the lowerbound will be tight if and only if this approximation is exact and the KL divergence is zero. As we ultimately aim at inferring the parameters and latent variables that maximize the marginal loglikelihood, we will use the calculus of variations to find the functions and q_{Θ} that maximize the functional .
This naturally leads to a procedure very similar to the EM algorithm where, at the E step, we calculate the expectations of the parameters with respect to the variational distributions and q_{Θ}, and, at the M step, we recompute the variational distributions over the parameters.
We start by writing the functional derivative of with respect to :
Then we set this functional derivative to zero. We also make use of a frequent assumption, namely that the variational distribution fully factorizes over each individual latent variables (meanfield assumption):
Recognizing the expectation and factorizing q_{Θ}(Θ) into , we get:
where Taking the exponential: As this should be a distribution, it should sum to one, and therefore: where ("r" stands for "reponsability") Interestingly, even though we haven't specified anything yet about , we can see that it is of the same form as the prior on z_{n}, a Multinomial distribution.
We start by writing the functional derivative of with respect to q_{Θ}:
Then, when setting this functional derivative to zero, we obtain:
Note how no term involves both and μ_{k},τ_{k}. This naturally implies the factorization . And we can also notice the following factorization . Starting with :
Recognizing and taking the exponential, we get another Dirichlet distribution:
that is with And now, about the other parameters, we recognize a NormalGamma distribution.
It's easier to first define three statistics of the data with respect to the responsabilities:
. This allows us to concisely write (TODO: check the algebra...):
TODO
