User:Timothee Flutre/Notebook/Postdoc/2011/12/14: Difference between revisions
(→Learn about mixture models and the EM algorithm: better explain link with missing data formulation + add ref Cosma Shalizi) |
(→Learn about mixture models and the EM algorithm: clarify E step) |
||
(7 intermediate revisions by the same user not shown) | |||
Line 46: | Line 46: | ||
* '''Missing data''': we introduce the following N [http://en.wikipedia.org/wiki/Latent_variable latent variables] <math>Z_1,...,Z_i,...,Z_N</math> (also called hidden or allocation variables), one for each observation, such that <math>Z_i=k</math> means that observation <math>x_i</math> belongs to cluster <math>k</math>. In fact, it is much easier to work the equations when defining each <math>Z_i</math> as a vector of length <math>K</math>, with <math>Z_{ik}=1</math> if observation <math>x_i</math> belongs to cluster <math>k</math>, and <math>Z_{ik}=0</math> otherwise ([http://en.wikipedia.org/wiki/Dummy_variable_%28statistics%29 indicator variables]). Thanks to this, we can reinterpret the mixture weights: <math>\forall i, P(Z_i=k|\theta)=w_k</math>. Moreover, we can now define the membership probabilities, one for each observation: | * '''Missing data''': we introduce the following N [http://en.wikipedia.org/wiki/Latent_variable latent variables] <math>Z_1,...,Z_i,...,Z_N</math> (also called hidden or allocation variables), one for each observation, such that <math>Z_i=k</math> means that observation <math>x_i</math> belongs to cluster <math>k</math>. In fact, it is much easier to work the equations when defining each <math>Z_i</math> as a vector of length <math>K</math>, with <math>Z_{ik}=1</math> if observation <math>x_i</math> belongs to cluster <math>k</math>, and <math>Z_{ik}=0</math> otherwise ([http://en.wikipedia.org/wiki/Dummy_variable_%28statistics%29 indicator variables]). Thanks to this, we can reinterpret the mixture weights: <math>\forall i, P(Z_i=k|\theta)=w_k</math>. Moreover, we can now define the membership probabilities, one for each observation: | ||
<math>p(k|i) = P( | <math>p(k|i) = P(Z_{ik}=1|x_i,\theta) = \frac{w_k \phi(x_i|\mu_k,\sigma_k)}{\sum_{l=1}^K w_l \phi(x_i|\mu_l,\sigma_l)}</math> | ||
The observed-data likelihood (also called sometimes "incomplete" or "marginal", even though these appellations are misnomers) is still written the same way: | The observed-data likelihood (also called sometimes "incomplete" or "marginal", even though these appellations are misnomers) is still written the same way: | ||
Line 54: | Line 54: | ||
But now we can also write the augmented-data likelihood, assuming all observations are independent conditionally on their membership: | But now we can also write the augmented-data likelihood, assuming all observations are independent conditionally on their membership: | ||
<math>L_{aug}(\theta) = P(X,Z|\theta) = \prod_{i=1}^N P(x_i| | <math>L_{aug}(\theta) = P(X,Z|\theta) = \prod_{i=1}^N P(x_i|Z_i,\theta) P(Z_i|\theta) = \prod_{i=1}^N \left( \prod_{k=1}^K \phi(x_i|\mu_k,\sigma_k)^{Z_{ik}} w_k^{Z_{ik}} \right)</math>. | ||
And here is the augmented-data log-likelihood (useful in the M step of the EM algorithm, see below): | |||
<math>l_{aug}(\theta) = \sum_{i=1}^N \left( \sum_{k=1}^K Z_{ik} ln(\phi(x_i|\mu_k,\sigma_k)) + \sum_{k=1}^K Z_{ik} ln(w_k) \right)</math> | |||
In terms of [http://en.wikipedia.org/wiki/Graphical_model graphical model], the Gaussian mixture model described here can be represented like [http://en.wikipedia.org/wiki/File:Nonbayesian-gaussian-mixture.svg this]. | |||
* '''EM algorithm - definition''': the idea is to iterate two steps, starting from randomly-initialized parameters. In the E-step, one computes the conditional expectation of the augmented-data log-likelihood function over the latent variables given the observed data and the parameter estimates from the previous iteration. Second, in the M-step, one maximizes this expected augmented-data log-likelihood function to determine the next iterate of the parameter estimates. | |||
** E step: <math>Q(\theta|X,\theta^{(t)}) = \mathbb{E}_{Z|X,\theta^{(t)}} \left[ ln(P(X,Z|\theta))|X,\theta^{(t)} \right] = \int l_{aug} q(Z|X,\theta^{(t)}) dZ</math> | ** E step: <math>Q(\theta|X,\theta^{(t)}) = \mathbb{E}_{Z|X,\theta^{(t)}} \left[ ln(P(X,Z|\theta))|X,\theta^{(t)} \right] = \int l_{aug} q(Z|X,\theta^{(t)}) dZ</math> | ||
** M-step: <math>\theta^{(t+1)} = argmax_{\theta} Q(\theta|X,\theta^{(t)})</math> so that <math>\forall \theta \in \Theta, Q(\theta^{(t+1)}|X,\theta^{(t)}) \ge Q(\theta|X,\theta^{(t)})</math> | ** M-step: <math>\theta^{(t+1)} = argmax_{\theta} Q(\theta|X,\theta^{(t)})</math> so that <math>\forall \theta \in \Theta, Q(\theta^{(t+1)}|X,\theta^{(t)}) \ge Q(\theta|X,\theta^{(t)})</math> | ||
* '''EM algorithm - theory''': Matthew Beal presents it in a great and simple way in his PhD thesis | * '''EM algorithm - theory''': stated like this above doesn't necessarily allow oneself to understand it immediately, at least in my case. Hopefully, Matthew Beal presents it in a great and simple way in his PhD thesis (see references at the bottom of the page). | ||
Here is the observed-data log-likelihood: | Here is the observed-data log-likelihood: | ||
<math>l_{obs} = \sum_{i=1}^N ln \left( f(x_i|\theta) \right)</math> | <math>l_{obs}(\theta) = \sum_{i=1}^N ln \left( f(x_i|\theta) \right)</math> | ||
First we introduce the hidden variables by integrating them out: | First we introduce the hidden variables by integrating them out: | ||
<math>l_{obs} = \sum_{i=1}^N ln \left( \int p(x_i,z_i|\theta) dz_i \right)</math> | <math>l_{obs}(\theta) = \sum_{i=1}^N ln \left( \int p(x_i,z_i|\theta) dz_i \right)</math> | ||
Then, we use any probability distribution <math>q</math> on these hidden variables (in fact, we use a distinct distribution <math>q_{z_i}</math> for each observation): | |||
<math>l_{obs}(\theta) = \sum_{i=1}^N ln \left( \int q_{z_i}(z_i) \frac{p(x_i,z_i|\theta)}{q_{z_i}(z_i)} dz_i \right)</math> | |||
And here is the great trick, as explained by Beal: "any probability distribution over the hidden variables gives rise to a lower bound on <math>l_{obs}</math>". This is due to to the [http://en.wikipedia.org/wiki/Jensen%27s_inequality Jensen inequality] (the logarithm is concave): | |||
<math>l_{obs}(\theta) \ge \sum_{i=1}^N \int q_{z_i}(z_i) ln \left( \frac{p(x_i,z_i|\theta)}{q_{z_i}(z_i)} \right) dz_i = \mathcal{F}(q_{z_1}(z_1), ..., q_{z_N}(z_N), \theta)</math> | |||
At each iteration, the E step maximizes the lower bound (<math>\mathcal{F}</math>) with respect to the <math>q_{z_i}(z_i)</math>: | |||
* E step: <math>q^{(t+1)}_{z_i} \leftarrow argmax_{q_{z_i}} \mathcal{F}(q_z(z), \theta^{(t)}) \forall i</math> | |||
* M step: <math>\theta^{(t+1)} \leftarrow argmax_\theta \mathcal{F}(q^{(t+1)}_z(z), \theta)</math> | |||
The E-step amounts to inferring the posterior distribution of the hidden variables <math>q^{(t+1)}_{z_i}</math> given the current parameter <math>\theta^{(t)}</math>: | |||
<math>q^{(t+1)}_{z_i}(z_i) = p(z_i | x_i, \theta^{(t)})</math> | |||
Indeed, the <math>q^{(t+1)}_{z_i}(z_i)</math> make the bound tight (the inequality becomes an equality): | |||
<math>\mathcal{F}(q^{(t+1)}_z(z), \theta^{(t)}) = \sum_{i=1}^N \int q^{(t+1)}_{z_i}(z_i) ln \left( \frac{p(x_i,z_i|\theta^{(t)})}{q^{(t+1)}_{z_i}(z_i)} \right) dz_i</math> | |||
<math>\mathcal{F}(q^{(t+1)}_z(z), \theta^{(t)}) = \sum_{i=1}^N \int p(z_i | x_i, \theta^{(t)}) ln \left( \frac{p(x_i,z_i|\theta^{(t)})}{p(z_i | x_i, \theta^{(t)})} \right) dz_i</math> | |||
<math>\mathcal{F}(q^{(t+1)}_z(z), \theta^{(t)}) = \sum_{i=1}^N \int p(z_i | x_i, \theta^{(t)}) ln \left( \frac{p(x_i|\theta^{(t)}) p(z_i|x_i,\theta^{(t)})}{p(z_i | x_i, \theta^{(t)})} \right) dz_i</math> | |||
<math>\mathcal{F}(q^{(t+1)}_z(z), \theta^{(t)}) = \sum_{i=1}^N \int p(z_i | x_i, \theta^{(t)}) ln \left( p(x_i|\theta^{(t)}) \right) dz_i</math> | |||
<math>\mathcal{F}(q^{(t+1)}_z(z), \theta^{(t)}) = \sum_{i=1}^N ln \left( p(x_i|\theta^{(t)}) \right)</math> | |||
<math> | <math>\mathcal{F}(q^{(t+1)}_z(z), \theta^{(t)}) = l_{obs}(\theta^{(t)})</math> | ||
Then, at the M step, we use these statistics to maximize the new lower bound <math>\mathcal{F}</math> with respect to <math>\theta</math>, and therefore find <math>\theta^{(t+1)}</math>. | |||
* '''EM algorithm - variational''': if the posterior distributions <math>p(z_i|x_i,\theta)</math> are intractable, we can use a variational approach to constrain them to be of a particular, tractable form. In the E step, maximizing <math>\mathcal{F}</math> with respect to <math>q_{z_i}</math> is equivalent to minimizing the [http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence Kullback-Leibler divergence] between the variational distribution <math>q(z_i)</math> and the exact hidden variable posterior <math>p(z_i|x_i,\theta)</math>: | |||
<math> | <math>KL[q_{z_i}(z_i) || p(z_i|x_i,\theta)] = \int q_{z_i}(z_i) ln \left( \frac{q_{z_i}(z_i)}{p(z_i|x_i,\theta)} \right)</math> | ||
As a result, the E step may not always lead to a tight bound. | |||
* '''Formulas of both steps''': in both steps we need to use <math>Q</math>, whether to evaluate it or maximize it. | |||
<math>Q( | <math>Q(\theta|X,\theta^{(t)}) = \mathbb{E}_{Z|X,\theta^{(t)}} \left[ ln(P(X,Z|\theta))|X,\theta^{(t)} \right]</math> | ||
<math>Q( | <math>Q(\theta|X,\theta^{(t)}) = \mathbb{E}_{Z|X,\theta^{(t)}} \left[ l_{aug}(\theta)|X,\theta^{(t)} \right]</math> | ||
<math>Q( | <math>Q(\theta|X,\theta^{(t)}) = \sum_{i=1}^N \left( \sum_{k=1}^K \mathbb{E}_{Z|X,\theta^{(t)}}[Z_{ik}|x_i,\theta_k^{(t)}] ln(\phi(x_i|\mu_k,\sigma_k)) + \sum_{k=1}^K \mathbb{E}_{Z|X,\theta^{(t)}}[Z_{ik}|x_i,\theta_k^{(t)}] ln(w_k) \right)</math> | ||
<math>Q | * '''Formulas of the E step''': as indicated above, the E step consists in evaluating <math>Q</math>, i.e. simply evaluating the conditional expectation over the latent variables of the augmented-data log-likelihood given the observed data and the current estimates of the parameters. | ||
<math> | <math>\mathbb{E}_{Z|X,\theta^{(t)}}[Z_{ik}|x_i,\theta_k^{(t)}] = P(Z_{ik}=1|x_i,\theta_k^{(t)}) = \frac{w_k^{(t)} \phi(x_i|\mu_k^{(t)},\sigma_k^{(t)})}{\sum_{l=1}^K w_l^{(t)} \phi(x_i|\mu_l^{(t)},\sigma_l^{(t)})} = p(k|i)</math> | ||
* '''Formulas of the M step''': in this step, we need to maximize <math>Q</math> (also written <math>\mathcal{F}</math> above), w.r.t. each <math>\theta_k</math>. A few important rules are required to write down the analytical formulas of the MLEs, but only from a high-school level (see [http://en.wikipedia.org/wiki/Differentiation_%28mathematics%29#Rules_for_finding_the_derivative here]). | |||
* '''M step - weights''': let's start by finding the maximum-likelihood estimates of the weights <math>w_k</math>. But remember the constraint <math>\sum_{k=1}^K w_k = 1</math>. To enforce it, we can use a [http://en.wikipedia.org/wiki/Lagrange_multiplier Lagrange multiplier], <math>\lambda</math>. This means that we now need to maximize the following equation where <math>\Lambda</math> is a Lagrange function (only the part of Q being a function of the weights is kept): | |||
<math>\Lambda(w_k,\lambda) = \sum_{i=1}^N \left( \sum_{k=1}^K p(k|i) ln(w_k) \right) + \lambda (1 - \sum_{k=1}^K w_k)</math> | |||
As usual, to find the maximum, we derive and equal to zero: | |||
<math>\frac{\Lambda}{\partial w_k} = \sum_{i=1}^N \left( p(k|i) \frac{1}{\hat{w}_k^{(t+1)}} \right) - \lambda = 0</math> | |||
<math>\hat{w}_k^{(t+1)} = \frac{1}{\lambda} \sum_{i=1}^N p(k|i)</math> | |||
Now, to find the multiplier, we go back to the constraint: | |||
<math>\ | <math>\sum_{k=1}^K \hat{w}_k^{(t+1)} = 1 \rightarrow \lambda = \sum_{i=1}^N \sum_{k=1}^K p(k|i) = N</math> | ||
Finally: | |||
<math>\ | <math>\hat{w}_k^{(t+1)} = \frac{1}{N} \sum_{i=1}^N p(k|i)</math> | ||
* '''M step - means''': | |||
<math>\frac{\partial Q}{\partial \mu_k} = \sum_{i=1}^N p(k|i) \frac{\partial ln(\phi(x_i|\mu_k,\sigma_k))}{\partial \mu_k}</math> | |||
<math>\frac{\partial | <math>\frac{\partial Q}{\partial \mu_k} = \sum_{i=1}^N p(k|i) \frac{1}{\phi(x_i|\mu_k,\sigma_k)} \frac{\partial \phi(x_i|\mu_k,\sigma_k)}{\partial \mu_k}</math> | ||
<math>\frac{\partial Q}{\partial \mu_k} = 0 = \sum_{i=1}^N p(k|i) (x_i - \hat{\mu}_k^{(t+1)})</math> | |||
Finally: | |||
<math>\hat{\mu}_k^{(t+1)} = \frac{\sum_{i=1}^N p(k/i) x_i}{\sum_{i=1}^N p(k/i)}</math> | |||
* '''M step - variances''': same kind of algebra | |||
<math>\frac{\partial | <math>\frac{\partial Q}{\partial \sigma_k} = \sum_{i=1}^N p(k/i) (\frac{-1}{\sigma_k} + \frac{(x_i - \mu_k)^2}{\sigma_k^3})</math> | ||
<math>\hat{\sigma}_k^{(t+1)} = \sqrt{\frac{\sum_{i=1}^N p(k/i) (x_i - \hat{\mu}_k^{(t+1)})^2}{\sum_{i=1}^N p(k/i)}}</math> | |||
* '''M step - weights (2)''': we can write them in terms of unconstrained variables <math>\gamma_k</math> ([http://en.wikipedia.org/wiki/Softmax_activation_function softmax function]): | |||
<math>w_k = \frac{e^{\gamma_k}}{\sum_{k=1}^K e^{\gamma_k}}</math> | <math>w_k = \frac{e^{\gamma_k}}{\sum_{k=1}^K e^{\gamma_k}}</math> | ||
Line 158: | Line 186: | ||
<math>\frac{\partial l(\theta)}{\partial w_k} = \sum_{i=1}^N (p(k/i) - w_k)</math> | <math>\frac{\partial l(\theta)}{\partial w_k} = \sum_{i=1}^N (p(k/i) - w_k)</math> | ||
Finally | Finally: | ||
<math>\hat{w}_k = \frac{1}{N} \sum_{i=1}^N p(k/i)</math> | <math>\hat{w}_k = \frac{1}{N} \sum_{i=1}^N p(k/i)</math> | ||
* '''R code to simulate data''': | * '''R code to simulate data''': if you read up to there, nothing is better than implementing the EM algorithm yourself! | ||
#' Generate univariate observations from a mixture of Normals | <nowiki> | ||
#' Generate univariate observations from a mixture of Normals | |||
#' | |||
#' @param K number of components | |||
#' @param N number of observations | |||
#' @param gap difference between all component means | |||
GetUnivariateSimulatedData <- function(K=2, N=100, gap=6){ | |||
mus <- seq(0, gap*(K-1), gap) | |||
sigmas <- runif(n=K, min=0.5, max=1.5) | |||
tmp <- floor(rnorm(n=K-1, mean=floor(N/K), sd=5)) | |||
ns <- c(tmp, N - sum(tmp)) | |||
clusters <- as.factor(matrix(unlist(lapply(1:K, function(k){rep(k, ns[k])})), | |||
ncol=1)) | |||
obs <- matrix(unlist(lapply(1:K, function(k){ | |||
rnorm(n=ns[k], mean=mus[k], sd=sigmas[k]) | |||
}))) | |||
new.order <- sample(1:N, N) | |||
obs <- obs[new.order] | |||
rownames(obs) <- NULL | |||
clusters <- clusters[new.order] | |||
return(list(obs=obs, clusters=clusters, mus=mus, sigmas=sigmas, | |||
mix.weights=ns/N)) | |||
} | |||
</nowiki> | |||
* '''R code for the E step''': | * '''R code for the E step''': | ||
#' Return probas of latent variables given data and parameters from previous iteration | <nowiki> | ||
#' Return probas of latent variables given data and parameters from previous iteration | |||
#' | |||
#' @param data Nx1 vector of observations | |||
#' @param params list which components are mus, sigmas and mix.weights | |||
Estep <- function(data, params){ | |||
GetMembershipProbas(data, params$mus, params$sigmas, params$mix.weights) | |||
} | |||
#' Return the membership probabilities P(zi=k/xi,theta) | |||
#' | |||
#' @param data Nx1 vector of observations | |||
#' @param mus Kx1 vector of means | |||
#' @param sigmas Kx1 vector of std deviations | |||
#' @param mix.weights Kx1 vector of mixture weights w_k=P(zi=k/theta) | |||
#' @return NxK matrix of membership probas | |||
GetMembershipProbas <- function(data, mus, sigmas, mix.weights){ | |||
N <- length(data) | |||
K <- length(mus) | |||
tmp <- matrix(unlist(lapply(1:N, function(i){ | |||
x <- data[i] | |||
norm.const <- sum(unlist(Map(function(mu, sigma, mix.weight){ | |||
mix.weight * GetUnivariateNormalDensity(x, mu, sigma)}, mus, sigmas, mix.weights))) | |||
unlist(Map(function(mu, sigma, mix.weight){ | |||
mix.weight * GetUnivariateNormalDensity(x, mu, sigma) / norm.const | |||
}, mus[-K], sigmas[-K], mix.weights[-K])) | |||
})), ncol=K-1, byrow=TRUE) | |||
membership.probas <- cbind(tmp, apply(tmp, 1, function(x){1 - sum(x)})) | |||
names(membership.probas) <- NULL | |||
return(membership.probas) | |||
} | |||
#' Univariate Normal density | |||
GetUnivariateNormalDensity <- function(x, mu, sigma){ | |||
return( 1/(sigma * sqrt(2*pi)) * exp(-1/(2*sigma^2)*(x-mu)^2) ) | |||
} | |||
</nowiki> | |||
* '''R code for the M step''': | * '''R code for the M step''': | ||
#' Return ML estimates of parameters | <nowiki> | ||
#' Return ML estimates of parameters | |||
#' | |||
#' @param data Nx1 vector of observations | |||
#' @param params list which components are mus, sigmas and mix.weights | |||
#' @param membership.probas NxK matrix with entry i,k being P(zi=k/xi,theta) | |||
Mstep <- function(data, params, membership.probas){ | |||
params.new <- list() | |||
sum.membership.probas <- apply(membership.probas, 2, sum) | |||
params.new$mus <- GetMlEstimMeans(data, membership.probas, | |||
sum.membership.probas) | |||
params.new$sigmas <- GetMlEstimStdDevs(data, params.new$mus, | |||
membership.probas, | |||
sum.membership.probas) | |||
params.new$mix.weights <- GetMlEstimMixWeights(data, membership.probas, | |||
sum.membership.probas) | |||
return(params.new) | |||
} | |||
#' Return ML estimates of the means (1 per cluster) | |||
#' | |||
#' @param data Nx1 vector of observations | |||
#' @param membership.probas NxK matrix with entry i,k being P(zi=k/xi,theta) | |||
#' @param sum.membership.probas Kx1 vector of sum per column of matrix above | |||
#' @return Kx1 vector of means | |||
GetMlEstimMeans <- function(data, membership.probas, sum.membership.probas){ | |||
K <- ncol(membership.probas) | |||
sapply(1:K, function(k){ | |||
sum(unlist(Map("*", membership.probas[,k], data))) / | |||
sum.membership.probas[k] | |||
}) | |||
} | |||
#' Return ML estimates of the std deviations (1 per cluster) | |||
#' | |||
#' @param data Nx1 vector of observations | |||
#' @param membership.probas NxK matrix with entry i,k being P(zi=k/xi,theta) | |||
#' @param sum.membership.probas Kx1 vector of sum per column of matrix above | |||
#' @return Kx1 vector of std deviations | |||
GetMlEstimStdDevs <- function(data, means, membership.probas, | |||
sum.membership.probas){ | |||
K <- ncol(membership.probas) | |||
sapply(1:K, function(k){ | |||
sqrt(sum(unlist(Map(function(p.ki, x.i){ | |||
p.ki * (x.i - means[k])^2 | |||
}, membership.probas[,k], data))) / | |||
sum.membership.probas[k]) | |||
}) | |||
} | |||
#' Return ML estimates of the mixture weights | |||
#' | |||
#' @param data Nx1 vector of observations | |||
#' @param membership.probas NxK matrix with entry i,k being P(zi=k/xi,theta) | |||
#' @param sum.membership.probas Kx1 vector of sum per column of matrix above | |||
#' @return Kx1 vector of mixture weights | |||
GetMlEstimMixWeights <- function(data, membership.probas, | |||
sum.membership.probas){ | |||
K <- ncol(membership.probas) | |||
sapply(1:K, function(k){ | |||
1/length(data) * sum.membership.probas[k] | |||
}) | |||
} | |||
</nowiki> | |||
* '''R code for the log-likelihood''': | * '''R code for the log-likelihood''': | ||
GetLogLikelihood <- function(data, mus, sigmas, mix.weights){ | <nowiki> | ||
GetLogLikelihood <- function(data, mus, sigmas, mix.weights){ | |||
loglik <- sum(sapply(data, function(x){ | |||
log(sum(unlist(Map(function(mu, sigma, mix.weight){ | |||
mix.weight * GetUnivariateNormalDensity(x, mu, sigma) | |||
}, mus, sigmas, mix.weights)))) | |||
})) | |||
return(loglik) | |||
} | |||
</nowiki> | |||
* '''R code for the EM loop''': | * '''R code for the EM loop''': | ||
EMalgo <- function(data, params, threshold.convergence=10^(-2), nb.iter=10, | <nowiki> | ||
EMalgo <- function(data, params, threshold.convergence=10^(-2), nb.iter=10, | |||
verbose=1){ | |||
logliks <- vector() | |||
i <- 1 | |||
if(verbose > 0) cat(paste("iter ", i, "\n", sep="")) | |||
membership.probas <- Estep(data, params) | |||
params <- Mstep(data, params, membership.probas) | |||
loglik <- GetLogLikelihood(data, params$mus, params$sigmas, | |||
params$mix.weights) | |||
logliks <- append(logliks, loglik) | |||
while(i < nb.iter){ | |||
i <- i + 1 | |||
if(verbose > 0) cat(paste("iter ", i, "\n", sep="")) | |||
membership.probas <- Estep(data, params) | |||
params <- Mstep(data, params, membership.probas) | |||
loglik <- GetLogLikelihood(data, params$mus, params$sigmas, params$mix.weights) | |||
if(loglik < logliks[length(logliks)]){ | |||
msg <- paste("the log-likelihood is decreasing:", loglik, "<", logliks[length(logliks)]) | |||
stop(msg, call.=FALSE) | |||
} | |||
logliks <- append(logliks, loglik) | |||
if(abs(logliks[i] - logliks[i-1]) <= threshold.convergence) | |||
break | |||
} | |||
return(list(params=params, membership.probas=membership.probas, logliks=logliks, nb.iters=i)) | |||
} | |||
</nowiki> | |||
* '''Example''': and now, let's try it! | * '''Example''': and now, let's try it! | ||
## simulate data | <nowiki> | ||
## simulate data | |||
K <- 3 | |||
N <- 300 | |||
simul <- GetUnivariateSimulatedData(K, N) | |||
data <- simul$obs | |||
## run the EM algorithm | |||
params0 <- list(mus=runif(n=K, min=min(data), max=max(data)), | |||
sigmas=rep(1, K), | |||
mix.weights=rep(1/K, K)) | |||
res <- EMalgo(data, params0, 10^(-3), 1000, 1) | |||
## check its convergence | |||
plot(res$logliks, xlab="iterations", ylab="log-likelihood", | |||
main="Convergence of the EM algorithm", type="b") | |||
## plot the data along with the inferred densities | |||
png("mixture_univar_em.png") | |||
hist(data, breaks=30, freq=FALSE, col="grey", border="white", ylim=c(0,0.15), | |||
main="Histogram of data overlaid with densities inferred by EM") | |||
rx <- seq(from=min(data), to=max(data), by=0.1) | |||
ds <- lapply(1:K, function(k){dnorm(x=rx, mean=res$params$mus[k], sd=res$params$sigmas[k])}) | |||
f <- sapply(1:length(rx), function(i){ | |||
res$params$mix.weights[1] * ds[[1]][i] + res$params$mix.weights[2] * ds[[2]][i] + res$params$mix.weights[3] * ds[[3]][i] | |||
}) | |||
lines(rx, f, col="red", lwd=2) | |||
dev.off() | |||
</nowiki> | |||
It seems to work well, which was expected as the clusters are well separated from each other... | It seems to work well, which was expected as the clusters are well separated from each other... | ||
Line 368: | Line 407: | ||
The classification of each observation can be obtained via the following command: | The classification of each observation can be obtained via the following command: | ||
## get the classification of the observations | <nowiki> | ||
## get the classification of the observations | |||
memberships <- apply(res$membership.probas, 1, function(x){which(x > 0.5)}) | |||
table(memberships) | |||
</nowiki> | |||
* '''References''': | * '''References''': | ||
** chapter 1 from the PhD thesis of Matthew Stephens (Oxford, 2000) | ** chapter 1 from the PhD thesis of Matthew Stephens (Oxford, 2000) freely available [http://www.stat.washington.edu/stephens/papers/tabstract.html online] | ||
** chapter 2 from the PhD thesis of Matthew Beal (UCL, 2003) | ** chapter 2 from the PhD thesis of Matthew Beal (UCL, 2003) freely available [http://www.cse.buffalo.edu/faculty/mbeal/thesis/ online] | ||
** lecture "Mixture Models, Latent Variables and the EM Algorithm" from Cosma Shalizi | ** lecture "Mixture Models, Latent Variables and the EM Algorithm" from Cosma Shalizi freely available [http://www.stat.cmu.edu/~cshalizi/uADA/12/ online] | ||
** book "Introducing Monte Carlo Methods with R" from Robert and and Casella (2009) | ** book "Introducing Monte Carlo Methods with R" from Robert and and Casella (2009) | ||
Revision as of 15:33, 8 May 2013
Project name | <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page <html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html> </html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html> |
Learn about mixture models and the EM algorithm(Caution, this is my own quick-and-dirty tutorial, see the references at the end for presentations by professional statisticians.)
[math]\displaystyle{ f(x_i|\theta) = \sum_{k=1}^{K} w_k \phi(x_i|\mu_k,\sigma_k) = \sum_{k=1}^{K} w_k \frac{1}{\sqrt{2\pi} \sigma_k} \exp \left(-\frac{1}{2}(\frac{x_i - \mu_k}{\sigma_k})^2 \right) }[/math] The constraints are: [math]\displaystyle{ \forall k, w_k \gt 0 }[/math] and [math]\displaystyle{ \sum_{k=1}^K w_k = 1 }[/math]
[math]\displaystyle{ L(\theta) = P(X|\theta) = \prod_{i=1}^N f(x_i|\theta) }[/math] As usual, it's easier to deal with the log-likelihood: [math]\displaystyle{ l(\theta) = \sum_{i=1}^N ln \left( f(x_i|\theta) \right) = \sum_{i=1}^N ln \left( \sum_{k=1}^K w_k \phi(x_i; \theta_k) \right) }[/math] Let's take the derivative with respect to one parameter, eg. [math]\displaystyle{ \theta_l }[/math]: [math]\displaystyle{ \frac{\partial l}{\partial \theta_l} = \sum_{i=1}^N \frac{1}{\sum_{k=1}^K w_k \phi(x_i; \theta_k)} w_l \frac{\partial \phi(x_i; \theta_l)}{\partial \theta_l} }[/math] [math]\displaystyle{ \frac{\partial l}{\partial \theta_l} = \sum_{i=1}^N \frac{w_l \phi(x_i; \theta_l)}{\sum_{k=1}^K w_k \phi(x_i; \theta_k)} \frac{1}{\phi(x_i; \theta_l)} \frac{\partial \phi(x_i; \theta_l)}{\partial \theta_l} }[/math] [math]\displaystyle{ \frac{\partial l}{\partial \theta_l} = \sum_{i=1}^N \frac{w_l \phi(x_i; \theta_l)}{\sum_{k=1}^K w_k \phi(x_i; \theta_k)} \frac{\partial ln ( \phi(x_i; \theta_l) )}{\partial \theta_l} }[/math] This shows that maximizing the likelihood of a mixture model is like doing a weighted likelihood maximization. However, these weights depend on the parameters we want to estimate! That's why we now switch to the missing-data formulation of the mixture model.
[math]\displaystyle{ p(k|i) = P(Z_{ik}=1|x_i,\theta) = \frac{w_k \phi(x_i|\mu_k,\sigma_k)}{\sum_{l=1}^K w_l \phi(x_i|\mu_l,\sigma_l)} }[/math] The observed-data likelihood (also called sometimes "incomplete" or "marginal", even though these appellations are misnomers) is still written the same way: [math]\displaystyle{ L_{obs}(\theta) = P(X|\theta) = \prod_{i=1}^N f(x_i|\theta) }[/math] But now we can also write the augmented-data likelihood, assuming all observations are independent conditionally on their membership: [math]\displaystyle{ L_{aug}(\theta) = P(X,Z|\theta) = \prod_{i=1}^N P(x_i|Z_i,\theta) P(Z_i|\theta) = \prod_{i=1}^N \left( \prod_{k=1}^K \phi(x_i|\mu_k,\sigma_k)^{Z_{ik}} w_k^{Z_{ik}} \right) }[/math]. And here is the augmented-data log-likelihood (useful in the M step of the EM algorithm, see below): [math]\displaystyle{ l_{aug}(\theta) = \sum_{i=1}^N \left( \sum_{k=1}^K Z_{ik} ln(\phi(x_i|\mu_k,\sigma_k)) + \sum_{k=1}^K Z_{ik} ln(w_k) \right) }[/math] In terms of graphical model, the Gaussian mixture model described here can be represented like this.
Here is the observed-data log-likelihood: [math]\displaystyle{ l_{obs}(\theta) = \sum_{i=1}^N ln \left( f(x_i|\theta) \right) }[/math] First we introduce the hidden variables by integrating them out: [math]\displaystyle{ l_{obs}(\theta) = \sum_{i=1}^N ln \left( \int p(x_i,z_i|\theta) dz_i \right) }[/math] Then, we use any probability distribution [math]\displaystyle{ q }[/math] on these hidden variables (in fact, we use a distinct distribution [math]\displaystyle{ q_{z_i} }[/math] for each observation): [math]\displaystyle{ l_{obs}(\theta) = \sum_{i=1}^N ln \left( \int q_{z_i}(z_i) \frac{p(x_i,z_i|\theta)}{q_{z_i}(z_i)} dz_i \right) }[/math] And here is the great trick, as explained by Beal: "any probability distribution over the hidden variables gives rise to a lower bound on [math]\displaystyle{ l_{obs} }[/math]". This is due to to the Jensen inequality (the logarithm is concave): [math]\displaystyle{ l_{obs}(\theta) \ge \sum_{i=1}^N \int q_{z_i}(z_i) ln \left( \frac{p(x_i,z_i|\theta)}{q_{z_i}(z_i)} \right) dz_i = \mathcal{F}(q_{z_1}(z_1), ..., q_{z_N}(z_N), \theta) }[/math] At each iteration, the E step maximizes the lower bound ([math]\displaystyle{ \mathcal{F} }[/math]) with respect to the [math]\displaystyle{ q_{z_i}(z_i) }[/math]:
The E-step amounts to inferring the posterior distribution of the hidden variables [math]\displaystyle{ q^{(t+1)}_{z_i} }[/math] given the current parameter [math]\displaystyle{ \theta^{(t)} }[/math]: [math]\displaystyle{ q^{(t+1)}_{z_i}(z_i) = p(z_i | x_i, \theta^{(t)}) }[/math] Indeed, the [math]\displaystyle{ q^{(t+1)}_{z_i}(z_i) }[/math] make the bound tight (the inequality becomes an equality): [math]\displaystyle{ \mathcal{F}(q^{(t+1)}_z(z), \theta^{(t)}) = \sum_{i=1}^N \int q^{(t+1)}_{z_i}(z_i) ln \left( \frac{p(x_i,z_i|\theta^{(t)})}{q^{(t+1)}_{z_i}(z_i)} \right) dz_i }[/math] [math]\displaystyle{ \mathcal{F}(q^{(t+1)}_z(z), \theta^{(t)}) = \sum_{i=1}^N \int p(z_i | x_i, \theta^{(t)}) ln \left( \frac{p(x_i,z_i|\theta^{(t)})}{p(z_i | x_i, \theta^{(t)})} \right) dz_i }[/math] [math]\displaystyle{ \mathcal{F}(q^{(t+1)}_z(z), \theta^{(t)}) = \sum_{i=1}^N \int p(z_i | x_i, \theta^{(t)}) ln \left( \frac{p(x_i|\theta^{(t)}) p(z_i|x_i,\theta^{(t)})}{p(z_i | x_i, \theta^{(t)})} \right) dz_i }[/math] [math]\displaystyle{ \mathcal{F}(q^{(t+1)}_z(z), \theta^{(t)}) = \sum_{i=1}^N \int p(z_i | x_i, \theta^{(t)}) ln \left( p(x_i|\theta^{(t)}) \right) dz_i }[/math] [math]\displaystyle{ \mathcal{F}(q^{(t+1)}_z(z), \theta^{(t)}) = \sum_{i=1}^N ln \left( p(x_i|\theta^{(t)}) \right) }[/math] [math]\displaystyle{ \mathcal{F}(q^{(t+1)}_z(z), \theta^{(t)}) = l_{obs}(\theta^{(t)}) }[/math] Then, at the M step, we use these statistics to maximize the new lower bound [math]\displaystyle{ \mathcal{F} }[/math] with respect to [math]\displaystyle{ \theta }[/math], and therefore find [math]\displaystyle{ \theta^{(t+1)} }[/math].
[math]\displaystyle{ KL[q_{z_i}(z_i) || p(z_i|x_i,\theta)] = \int q_{z_i}(z_i) ln \left( \frac{q_{z_i}(z_i)}{p(z_i|x_i,\theta)} \right) }[/math] As a result, the E step may not always lead to a tight bound.
[math]\displaystyle{ Q(\theta|X,\theta^{(t)}) = \mathbb{E}_{Z|X,\theta^{(t)}} \left[ ln(P(X,Z|\theta))|X,\theta^{(t)} \right] }[/math] [math]\displaystyle{ Q(\theta|X,\theta^{(t)}) = \mathbb{E}_{Z|X,\theta^{(t)}} \left[ l_{aug}(\theta)|X,\theta^{(t)} \right] }[/math] [math]\displaystyle{ Q(\theta|X,\theta^{(t)}) = \sum_{i=1}^N \left( \sum_{k=1}^K \mathbb{E}_{Z|X,\theta^{(t)}}[Z_{ik}|x_i,\theta_k^{(t)}] ln(\phi(x_i|\mu_k,\sigma_k)) + \sum_{k=1}^K \mathbb{E}_{Z|X,\theta^{(t)}}[Z_{ik}|x_i,\theta_k^{(t)}] ln(w_k) \right) }[/math]
[math]\displaystyle{ \mathbb{E}_{Z|X,\theta^{(t)}}[Z_{ik}|x_i,\theta_k^{(t)}] = P(Z_{ik}=1|x_i,\theta_k^{(t)}) = \frac{w_k^{(t)} \phi(x_i|\mu_k^{(t)},\sigma_k^{(t)})}{\sum_{l=1}^K w_l^{(t)} \phi(x_i|\mu_l^{(t)},\sigma_l^{(t)})} = p(k|i) }[/math]
[math]\displaystyle{ \Lambda(w_k,\lambda) = \sum_{i=1}^N \left( \sum_{k=1}^K p(k|i) ln(w_k) \right) + \lambda (1 - \sum_{k=1}^K w_k) }[/math] As usual, to find the maximum, we derive and equal to zero: [math]\displaystyle{ \frac{\Lambda}{\partial w_k} = \sum_{i=1}^N \left( p(k|i) \frac{1}{\hat{w}_k^{(t+1)}} \right) - \lambda = 0 }[/math] [math]\displaystyle{ \hat{w}_k^{(t+1)} = \frac{1}{\lambda} \sum_{i=1}^N p(k|i) }[/math] Now, to find the multiplier, we go back to the constraint: [math]\displaystyle{ \sum_{k=1}^K \hat{w}_k^{(t+1)} = 1 \rightarrow \lambda = \sum_{i=1}^N \sum_{k=1}^K p(k|i) = N }[/math] Finally: [math]\displaystyle{ \hat{w}_k^{(t+1)} = \frac{1}{N} \sum_{i=1}^N p(k|i) }[/math]
[math]\displaystyle{ \frac{\partial Q}{\partial \mu_k} = \sum_{i=1}^N p(k|i) \frac{\partial ln(\phi(x_i|\mu_k,\sigma_k))}{\partial \mu_k} }[/math] [math]\displaystyle{ \frac{\partial Q}{\partial \mu_k} = \sum_{i=1}^N p(k|i) \frac{1}{\phi(x_i|\mu_k,\sigma_k)} \frac{\partial \phi(x_i|\mu_k,\sigma_k)}{\partial \mu_k} }[/math] [math]\displaystyle{ \frac{\partial Q}{\partial \mu_k} = 0 = \sum_{i=1}^N p(k|i) (x_i - \hat{\mu}_k^{(t+1)}) }[/math] Finally: [math]\displaystyle{ \hat{\mu}_k^{(t+1)} = \frac{\sum_{i=1}^N p(k/i) x_i}{\sum_{i=1}^N p(k/i)} }[/math]
[math]\displaystyle{ \frac{\partial Q}{\partial \sigma_k} = \sum_{i=1}^N p(k/i) (\frac{-1}{\sigma_k} + \frac{(x_i - \mu_k)^2}{\sigma_k^3}) }[/math] [math]\displaystyle{ \hat{\sigma}_k^{(t+1)} = \sqrt{\frac{\sum_{i=1}^N p(k/i) (x_i - \hat{\mu}_k^{(t+1)})^2}{\sum_{i=1}^N p(k/i)}} }[/math]
[math]\displaystyle{ w_k = \frac{e^{\gamma_k}}{\sum_{k=1}^K e^{\gamma_k}} }[/math] [math]\displaystyle{ \frac{\partial w_k}{\partial \gamma_j} = \begin{cases} w_k - w_k^2 & \mbox{if }j = k \\ -w_kw_j & \mbox{otherwise} \end{cases} }[/math] [math]\displaystyle{ \frac{\partial l(\theta)}{\partial w_k} = \sum_{i=1}^N (p(k/i) - w_k) }[/math] Finally: [math]\displaystyle{ \hat{w}_k = \frac{1}{N} \sum_{i=1}^N p(k/i) }[/math]
#' Generate univariate observations from a mixture of Normals #' #' @param K number of components #' @param N number of observations #' @param gap difference between all component means GetUnivariateSimulatedData <- function(K=2, N=100, gap=6){ mus <- seq(0, gap*(K-1), gap) sigmas <- runif(n=K, min=0.5, max=1.5) tmp <- floor(rnorm(n=K-1, mean=floor(N/K), sd=5)) ns <- c(tmp, N - sum(tmp)) clusters <- as.factor(matrix(unlist(lapply(1:K, function(k){rep(k, ns[k])})), ncol=1)) obs <- matrix(unlist(lapply(1:K, function(k){ rnorm(n=ns[k], mean=mus[k], sd=sigmas[k]) }))) new.order <- sample(1:N, N) obs <- obs[new.order] rownames(obs) <- NULL clusters <- clusters[new.order] return(list(obs=obs, clusters=clusters, mus=mus, sigmas=sigmas, mix.weights=ns/N)) }
#' Return probas of latent variables given data and parameters from previous iteration #' #' @param data Nx1 vector of observations #' @param params list which components are mus, sigmas and mix.weights Estep <- function(data, params){ GetMembershipProbas(data, params$mus, params$sigmas, params$mix.weights) } #' Return the membership probabilities P(zi=k/xi,theta) #' #' @param data Nx1 vector of observations #' @param mus Kx1 vector of means #' @param sigmas Kx1 vector of std deviations #' @param mix.weights Kx1 vector of mixture weights w_k=P(zi=k/theta) #' @return NxK matrix of membership probas GetMembershipProbas <- function(data, mus, sigmas, mix.weights){ N <- length(data) K <- length(mus) tmp <- matrix(unlist(lapply(1:N, function(i){ x <- data[i] norm.const <- sum(unlist(Map(function(mu, sigma, mix.weight){ mix.weight * GetUnivariateNormalDensity(x, mu, sigma)}, mus, sigmas, mix.weights))) unlist(Map(function(mu, sigma, mix.weight){ mix.weight * GetUnivariateNormalDensity(x, mu, sigma) / norm.const }, mus[-K], sigmas[-K], mix.weights[-K])) })), ncol=K-1, byrow=TRUE) membership.probas <- cbind(tmp, apply(tmp, 1, function(x){1 - sum(x)})) names(membership.probas) <- NULL return(membership.probas) } #' Univariate Normal density GetUnivariateNormalDensity <- function(x, mu, sigma){ return( 1/(sigma * sqrt(2*pi)) * exp(-1/(2*sigma^2)*(x-mu)^2) ) }
#' Return ML estimates of parameters #' #' @param data Nx1 vector of observations #' @param params list which components are mus, sigmas and mix.weights #' @param membership.probas NxK matrix with entry i,k being P(zi=k/xi,theta) Mstep <- function(data, params, membership.probas){ params.new <- list() sum.membership.probas <- apply(membership.probas, 2, sum) params.new$mus <- GetMlEstimMeans(data, membership.probas, sum.membership.probas) params.new$sigmas <- GetMlEstimStdDevs(data, params.new$mus, membership.probas, sum.membership.probas) params.new$mix.weights <- GetMlEstimMixWeights(data, membership.probas, sum.membership.probas) return(params.new) } #' Return ML estimates of the means (1 per cluster) #' #' @param data Nx1 vector of observations #' @param membership.probas NxK matrix with entry i,k being P(zi=k/xi,theta) #' @param sum.membership.probas Kx1 vector of sum per column of matrix above #' @return Kx1 vector of means GetMlEstimMeans <- function(data, membership.probas, sum.membership.probas){ K <- ncol(membership.probas) sapply(1:K, function(k){ sum(unlist(Map("*", membership.probas[,k], data))) / sum.membership.probas[k] }) } #' Return ML estimates of the std deviations (1 per cluster) #' #' @param data Nx1 vector of observations #' @param membership.probas NxK matrix with entry i,k being P(zi=k/xi,theta) #' @param sum.membership.probas Kx1 vector of sum per column of matrix above #' @return Kx1 vector of std deviations GetMlEstimStdDevs <- function(data, means, membership.probas, sum.membership.probas){ K <- ncol(membership.probas) sapply(1:K, function(k){ sqrt(sum(unlist(Map(function(p.ki, x.i){ p.ki * (x.i - means[k])^2 }, membership.probas[,k], data))) / sum.membership.probas[k]) }) } #' Return ML estimates of the mixture weights #' #' @param data Nx1 vector of observations #' @param membership.probas NxK matrix with entry i,k being P(zi=k/xi,theta) #' @param sum.membership.probas Kx1 vector of sum per column of matrix above #' @return Kx1 vector of mixture weights GetMlEstimMixWeights <- function(data, membership.probas, sum.membership.probas){ K <- ncol(membership.probas) sapply(1:K, function(k){ 1/length(data) * sum.membership.probas[k] }) }
GetLogLikelihood <- function(data, mus, sigmas, mix.weights){ loglik <- sum(sapply(data, function(x){ log(sum(unlist(Map(function(mu, sigma, mix.weight){ mix.weight * GetUnivariateNormalDensity(x, mu, sigma) }, mus, sigmas, mix.weights)))) })) return(loglik) }
EMalgo <- function(data, params, threshold.convergence=10^(-2), nb.iter=10, verbose=1){ logliks <- vector() i <- 1 if(verbose > 0) cat(paste("iter ", i, "\n", sep="")) membership.probas <- Estep(data, params) params <- Mstep(data, params, membership.probas) loglik <- GetLogLikelihood(data, params$mus, params$sigmas, params$mix.weights) logliks <- append(logliks, loglik) while(i < nb.iter){ i <- i + 1 if(verbose > 0) cat(paste("iter ", i, "\n", sep="")) membership.probas <- Estep(data, params) params <- Mstep(data, params, membership.probas) loglik <- GetLogLikelihood(data, params$mus, params$sigmas, params$mix.weights) if(loglik < logliks[length(logliks)]){ msg <- paste("the log-likelihood is decreasing:", loglik, "<", logliks[length(logliks)]) stop(msg, call.=FALSE) } logliks <- append(logliks, loglik) if(abs(logliks[i] - logliks[i-1]) <= threshold.convergence) break } return(list(params=params, membership.probas=membership.probas, logliks=logliks, nb.iters=i)) }
## simulate data K <- 3 N <- 300 simul <- GetUnivariateSimulatedData(K, N) data <- simul$obs ## run the EM algorithm params0 <- list(mus=runif(n=K, min=min(data), max=max(data)), sigmas=rep(1, K), mix.weights=rep(1/K, K)) res <- EMalgo(data, params0, 10^(-3), 1000, 1) ## check its convergence plot(res$logliks, xlab="iterations", ylab="log-likelihood", main="Convergence of the EM algorithm", type="b") ## plot the data along with the inferred densities png("mixture_univar_em.png") hist(data, breaks=30, freq=FALSE, col="grey", border="white", ylim=c(0,0.15), main="Histogram of data overlaid with densities inferred by EM") rx <- seq(from=min(data), to=max(data), by=0.1) ds <- lapply(1:K, function(k){dnorm(x=rx, mean=res$params$mus[k], sd=res$params$sigmas[k])}) f <- sapply(1:length(rx), function(i){ res$params$mix.weights[1] * ds[[1]][i] + res$params$mix.weights[2] * ds[[2]][i] + res$params$mix.weights[3] * ds[[3]][i] }) lines(rx, f, col="red", lwd=2) dev.off() It seems to work well, which was expected as the clusters are well separated from each other... The classification of each observation can be obtained via the following command: ## get the classification of the observations memberships <- apply(res$membership.probas, 1, function(x){which(x > 0.5)}) table(memberships)
|