User:Timothee Flutre/Notebook/Postdoc/2012/03/13

From OpenWetWare
Jump to navigationJump to search
Project name Main project page
Previous entry      Next entry

About low-rank matrix decomposition

(keywords: PCA, SVD, factor analysis, linear algebra, numerical analysis, statistical interpretation)


  • What is PCA? "The central idea of principal component analysis is to reduce the dimensionality of a data set in which there are a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This reduction is achieved by transforming to a new set of variables, the principal components, which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables." from the book "Principal component analysis" by I. Jolliffe.


  • Theoretical intuition:

Let be X is the data matrix. It is a N x P matrix recording the measurements made on N samples (in rows) at P variables (in columns). This means that [math]\displaystyle{ X_{np} }[/math] corresponds to the measurement of the [math]\displaystyle{ p^{th} }[/math] variable on the [math]\displaystyle{ n^{th} }[/math] sample, with [math]\displaystyle{ n \in \{1,...,N\} }[/math] and [math]\displaystyle{ p \in \{1,...,P\} }[/math].

Concretely, what the description of I. Jolliffe above means is the following. By doing a PCA of X, we build P linear combinations of the original P variables. These linear combinations are called "principal components" (PCs). Most importantly, we hope that the first K PCs account for most of the total sample variance, and that [math]\displaystyle{ K \ll P }[/math]). If this is the case, we can confidently approximate X by the first K columns of Y, thereby discarding the measurements at the remaining K-P variables.

In matrix notation, doing the PCA of X boils down to applying a linear transformation on X. More precisely, this transformation is a rotation in the directions of maximum variance in the vector space of the P variables. As a result of the PCA, a new matrix Y is created such that [math]\displaystyle{ Y = X A }[/math], with Y a N x P matrix with the rotated data, and A a P x P matrix whose columns are the principal components.


  • Mathematical derivation:

First, we need to find PC1, ie. the first column of the matrix A. According to what is written above (Y=XA), we can write down the N elements of the first column of Y:

[math]\displaystyle{ \begin{cases} Y_{11} = X_{11} A_{11} + X_{12} A_{21} + ... + X_{1P} A_{P1} \\ ... \\ Y_{n1} = X_{n1} A_{11} + X_{n2} A_{21} + ... + X_{nP} A_{P1} \\ ... \\ Y_{N1} = X_{N1} A_{11} + X_{N2} A_{21} + ... + X_{NP} A_{P1} \end{cases} }[/math]

Let's define the (column) vector [math]\displaystyle{ y_1 }[/math] as the first column of Y and the (column) vector [math]\displaystyle{ a_1 }[/math] as the first PC (first column of A). In matrix notation, we can re-write the system of equations above as [math]\displaystyle{ y_1 = X a_1 }[/math].

We therefore need to find [math]\displaystyle{ a_1 }[/math]. For this, remember that we want PC1 to account for most of the total variance of X. This translates mathematically into maximizing [math]\displaystyle{ V[y_1] }[/math], which in turns boils down to maximizing each element of [math]\displaystyle{ V[y_1] }[/math], ie. [math]\displaystyle{ V[Y_{11}] }[/math], ..., [math]\displaystyle{ V[Y_{n1}] }[/math], ..., [math]\displaystyle{ V[Y_{N1}] }[/math].

Let's do it for [math]\displaystyle{ V[Y_{n1}] }[/math]. If we define the (row) vector [math]\displaystyle{ x_n }[/math] to be the [math]\displaystyle{ n^{th} }[/math] row of X, then in matrix notation we have [math]\displaystyle{ V[Y_{n1}] = V[x_n a_1] }[/math]. However, it's more common to manipulate only column vectors, so let's define the (column) vector [math]\displaystyle{ x_n }[/math] to be the [math]\displaystyle{ n^{th} }[/math] row of X. This means that [math]\displaystyle{ V[Y_{n1}] = V[a_1^T x_n] }[/math]. Now let be [math]\displaystyle{ \Sigma_n }[/math] the covariance matrix of [math]\displaystyle{ x_n }[/math]. This allows us to develop [math]\displaystyle{ V[a_1^T x_n] }[/math] into [math]\displaystyle{ a_1^T \Sigma_n a_1 }[/math].

Without anything else, it's a bit odd to look for [math]\displaystyle{ a_1 }[/math] that maximizes such a formula. Indeed we need to add a constraint, for instance: [math]\displaystyle{ \lVert a_1 \rVert ^2 = a_1^T a_1 = 1 }[/math] (squared Euclidean norm, inner product). Such a constraint can be enforced using a Lagrange multiplier [math]\displaystyle{ \lambda }[/math]. At the end, what we want is to find [math]\displaystyle{ a_1 }[/math] that maximizes [math]\displaystyle{ a_1^T \Sigma_n a_1 - \lambda (a_1^T a_1 - 1) }[/math].

For this, we differentiate the above formula with respect to [math]\displaystyle{ a_1 }[/math] and set it to 0: [math]\displaystyle{ \Sigma_n a_1 - \lambda a_1 = 0 \Leftrightarrow \Sigma_n a_1 = \lambda a_1 }[/math]. This shows us that [math]\displaystyle{ \lambda }[/math] is an eigenvalue of [math]\displaystyle{ \Sigma_n }[/math] and that [math]\displaystyle{ a_1 }[/math] is the corresponding eigenvector.

However, which eigenvector to choose? Well, the one that corresponds to the largest eigenvalue because the later happens to be exactly the variance of each element of PC1. Indeed: [math]\displaystyle{ V[Y_{n1}] = V[a_1^T x_n] = a_1^T \Sigma_n a_1 = a_1^T \lambda a_1 = \lambda a_1^T a_1 = \lambda }[/math].

Now that we have the first PC, we want the second one. We do the same as above, but with the additional constraint that PC2 has to be uncorrelated with PC1, ie. [math]\displaystyle{ Cov[y_2, y_1] = 0 }[/math]. Here again, let's focus on the [math]\displaystyle{ n^{th} }[/math] element of the PCs and let's develop the covariance: [math]\displaystyle{ Cov[Y_{n2}, Y_{n1}] = Cov[a_2^T x_n, a_1^T x_n] = a_2^T \Sigma_n a_1 = \lambda_1 a_2^T a_1 }[/math]. Therefore, using two Lagrange multipliers, we now have to find [math]\displaystyle{ a_2 }[/math] that maximizes [math]\displaystyle{ a_2^T \Sigma_n a_2 - \lambda_2(a_2^T a_2 - 1) - \phi a_2^T a_1 }[/math]. Again, we differentiate w.r.t. to [math]\displaystyle{ a_2 }[/math] and set to 0: [math]\displaystyle{ \Sigma_n a_2 - \lambda_2 a_2 - \phi a_1 = 0 }[/math]. When we multiply on the left by [math]\displaystyle{ a_1^T }[/math] we get: [math]\displaystyle{ a_1^T \Sigma_n a_2 - \lambda_2 a_1^T a_2 - \phi a_1^T a_1 = 0 \Leftrightarrow \phi = 0 }[/math]. We can thus write [math]\displaystyle{ \Sigma_n a_2 = \lambda_2 a_2 }[/math] and find [math]\displaystyle{ a_2 }[/math] as the eigenvector corresponding to the second largest eigenvalue of [math]\displaystyle{ \Sigma_n }[/math].

By doing this again, we can find all the principal components.


  • EVD or SVD?

to do ...

As any rectangular matrix, we can factorize X using the singular value decomposition (SVD): [math]\displaystyle{ X = U D V^T }[/math]

where U is a NxN unitary matrix, D is a NxP diagonal matrix and V is a PxP unitary matrix. The columns of U are the left singular vectors of X, the columns of V are the right singular vectors of X and D contains the singular values of X.

The transformed data are therefore [math]\displaystyle{ Y = X V }[/math].


  • Example in R:

First, let's simulate data such as expression levels at 2 genes (the variables) in 100 individuals (the samples):

set.seed(1859)
N <- 100
P <- 2
exp.gene1 <- rnorm(N, mean=3)
exp.gene2 <- exp.gene1 + rnorm(N)
dat <- cbind(exp.gene1, exp.gene2)
plot(dat[,1], dat[,2], xlab="gene 1", ylab="gene 2", main="Original data")

Apply PCA using the the built-in "prcomp" function:

res.pca <- prcomp(dat, center=TRUE, scale.=FALSE)
summary(res.pca)
Importance of components:
                          PC1    PC2
Standard deviation     1.5848 0.6121
Proportion of Variance 0.8702 0.1298
Cumulative Proportion  0.8702 1.0000

Visualize the results:

plot(res.pca, main="Variances of each PC")
axis(side=1, at=c(0.8,2), labels=c(paste("PC1=", format(100*summary(res.pca)$importance[2,"PC1"], digits=2), "%", sep=""),
                            paste("PC2=", format(100*summary(res.pca)$importance[2,"PC2"], digits=2), "%", sep="")),
     tick=FALSE)
plot(res.pca$x[,1], res.pca$x[,2], xlab="PC1", ylab="PC2", main="PC loadings")
mtext("each point is a bivariate observation plotted with the PCs as axes", line=0.4)

to continue ...


  • from the article "A Genealogical Interpretation of Principal Components Analysis" by G. McVean

n individuals genotyped at L loci

[math]\displaystyle{ Z_{si} }[/math]: allelic state of individual i at SNP s, [math]\displaystyle{ \in {0,1} }[/math]

Z: L x n binary matrix

Usually we center the data (Z), here by removing the mean of the allelic states at each SNP (mean of each row)

[math]\displaystyle{ X_{si} = Z_{si} - \frac{1}{n} \sum_{j=1}^n Z_{sj} }[/math]

It is also common to standardize the rows so that they have unit variance.

We want to approximate the data in X by a matrix Y such that [math]\displaystyle{ Y = P X }[/math].