User:Matthew Whiteside/Notebook/Fumigatus Microarray/2009/05/08

Project name

Distance & Clustering

I could not run Matisse using Pearson (or any other distance metric) other than Spearman. Spearman is a non-parametric correlation metric.

Here are some notes on distances:

Pearson r is equal to the cosine (or vector dot product) of z-score transformed data.

Pearson: ρ(X,Y)
ρ_X,Y = (for i=1..n Σ Z(X)_i·Z(Y)_i) ÷ (n-1)

Z-score: z(x_i) = (x_i – μ_x)/θ_x (mean centered / standard dev normalized data).
Pearson assumes normality & is affected by outliers.
Spearman does not require normality (has some assumptions about symmetry in the distribution).

Spearman r
T = for i=1..n Σ(rank(X_i) - rank(Y_i))
r_s = 1 - (6T ÷ (n³ - n)

Results from Pearsons & ordinal correlation coeff (like Spearman & Kendal tau) are very similar when 2 conditions are met: no outliers & there is not really a pronounced non-linear relationship.
Spearman & Pearson are most similar for ~linear relationships. This is because ordering a interval data, linearizes the relationship i.e. a exponentially increasing trend when ordered is a simple increasing trend (flattening the end of the exponentially increasing trend). The pearsons & spearmans could be very different.
This can be used as follows:

If Pearson's r is much smaller than Spearman's rho applied to the same variables, we can conclude that the variables ARE consistently correlated,
but NOT in a linear fashion. When both correlation coefficients yield very similar values different from zero, there is  indication of a linear
relationship.

Outliers: Spearman is more robust to outliers. Case in point: a outlier value can increase without changing its rank. This won't change Spearmans, but will change Pearsons.
Spearmans is quasi-ordinal. Kendals tau makes even fewer assumptions about the data (namely that it is not ~symmetric).
Kendal's tau is computed as a ratio of C (concordant) - D (discordant) pairs between two rows. For example for 10 datapoints in each row, there would be 45 ( = 10*(10-1)/2) possible pairs (every possible combo). You then count the number of times row A point 1 & 2 is strictly above/below row B's point 1 & 2 = concordant, versus umber of times row A point 1 & 2 is above & below row B's point 1 & 2 = discordant. Then this value is normalized.
Significance testing of correlation coefficients (for 2 variables): t-test with n-2 degrees of freedom.

references: [Pearson & Spearman] [more] [even more]

User:Matthew Whiteside/Notebook/Fumigatus Microarray/2009/05/08

Distance & Clustering

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools