User:Timothee Flutre/Notebook/Postdoc/2011/12/14

From OpenWetWare
< User:Timothee Flutre‎ | Notebook‎ | Postdoc‎ | 2011‎ | 12
Revision as of 09:46, 14 December 2011 by Timothee Flutre (talk | contribs) (Entry title: add motivation, examples, aim, model)
Jump to: navigation, search
Owwnotebook icon.png Project name <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
<html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

Learn about mixture models and the EM algorithm

  • Motivation and examples: it's frequent to collect heterogeneous data, ie. for which we suspect that they come from several clusters. For instance, we measure the height of individuals without recording their gender, we measure the levels of expression of a gene in several individuals without recording which ones are healthy and which ones are sick, etc.
  • Data: we have N observations, noted Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle X = (x_1, x_2, ..., x_N)} . For the moment, we suppose that each observation Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle x_i} is univariate, ie. each corresponds to only one number.
  • Hypotheses and aim: let's assume that the data are heterogeneous and that they can be partitioned into Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle K} clusters (see examples above). This means that a subset of the observations come from cluster Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle k=1} , another subset come from cluster Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle k=2} , and so on.
  • Model: technically, we say that the observations were generated by a family of density functions. The density of all the observations is thus a mixture of densities, one per cluster. In our case, we will assume that each cluster Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle k} corresponds to a Normal distribution of mean Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \mu_k} and standard deviation Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \sigma_k} . Moreover, as we don't know for sure from which cluster a given observation comes from, we define the mixture probability Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle w_k} to be the probability that any given observation comes from cluster Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle k} . As a result, we have the following list of parameters: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \theta=(w_1,...,w_K,\mu_1,...\mu_K,\sigma_1,...,\sigma_K} . Finally, for a given observation Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle x_i} , we can write the model Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle f(x_i/\theta) = \sum_{k=1}^{K} w_k g(x_i/\mu_k,\sigma_k)} , wth Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle g} being the Normal distribution Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle g(x_i/\mu_k,\sigma_k) = \frac{1}{\sqrt{2\pi} \sigma_k} \exp^{-\frac{1}{2}(\frac{x_i - \mu_k}{\sigma_k})^2}}