# User:Timothee Flutre/Notebook/Postdoc/2011/11/20

Project name Main project page
Previous entry      Next entry

## Entry title

• Prepare journal club on "Analysis of population structure: A unifying framework and novel methods based on sparse factor analysis." by Engelhardt & Stephens (PLoS Genetics 2010).
• From "Inference of population structure using multilocus genotype data" by Pritchard, Stephens & Donnelly (Genetics 2000):
• data: genotypes at L loci for N individuals (matrix X: N x L) from several populations (K, unknown)
• aim: jointly assign individuals to populations while estimating population allele frequencies P, allow admixture, use MCMC
• From "Applied Multivariate Statistical Analysis" (Amazon):
• Let be $\mathbf{X}$ a vector of p observed variables with $\mathbf{\mu}$ as mean vector and $\mathbf{\Sigma}$ as covariance matrix.
• A principal component analysis is concerned with explaining the variance-covariance structure of $\mathbf{X}$ through a few linear (and uncorrelated) combinations of these variables. Although p components are required to reproduce the total variability, often much of this variability can be accounted for by a small number k of the principal components that depend solely on $\mathbf{\Sigma}$.
• A factor analysis attempts to describe the covariance relationships among the X's in terms of a few underlying, but unobservable, random quantities called factors. It postulates that $\mathbf{X}$ is linearly dependent upon k random variables F1,F2,...,Fk called factors, and p additional source of variation ε12,...,εp called errors. A matrix $\mathbf{\Lambda}$ contains the loadings lij of the ith variable on the jth factor: $\mathbf{X} = \mathbf{\mu} + \mathbf{\Lambda} \mathbf{F} + \mathbf{\epsilon}$
• The difference between the factor analysis model above and the multivariate linear regression model, $\mathbf{Y} = \mathbf{X} \mathbf{B} + \mathbf{\epsilon}$, is that in the latter both $\mathbf{Y}$ and $\mathbf{X}$ are observed, whereas in the former $\mathbf{F}$ is not.