User:Matthew Whiteside/Notebook/Fumigatus Microarray/2009/05/08

From OpenWetWare
Jump to navigationJump to search
Project name Main project page
Previous entry      Next entry

Distance & Clustering

I could not run Matisse using Pearson (or any other distance metric) other than Spearman. Spearman is a non-parametric correlation metric.

Here are some notes on distances:

  • Pearson r is equal to the cosine (or vector dot product) of z-score transformed data.
Pearson: ρ(X,Y)
ρX,Y = (for i=1..n Σ Z(X)i·Z(Y)i) ÷ (n-1)
  • Z-score: z(xi) = (xi – μx)/θx (mean centered / standard dev normalized data).
  • Pearson assumes normality & is affected by outliers.
  • Spearman does not require normality (has some assumptions about symmetry in the distribution).
Spearman r
T = for i=1..n Σ(rank(Xi) - rank(Yi))
rs = 1 - (6T ÷ (n3 - n)
  • Results from Pearsons & ordinal correlation coeff (like Spearman & Kendal tau) are very similar when 2 conditions are met: no outliers & there is not really a pronounced non-linear relationship.
  • Spearman & Pearson are most similar for ~linear relationships. This is because ordering a interval data, linearizes the relationship i.e. a exponentially increasing trend when ordered is a simple increasing trend (flattening the end of the exponentially increasing trend). The pearsons & spearmans could be very different.
  • This can be used as follows:
If Pearson's r is much smaller than Spearman's rho applied to the same variables, we can conclude that the variables ARE consistently correlated,
but NOT in a linear fashion. When both correlation coefficients yield very similar values different from zero, there is  indication of a linear
relationship.
  • Outliers: Spearman is more robust to outliers. Case in point: a outlier value can increase without changing its rank. This won't change Spearmans, but will change Pearsons.
  • Spearmans is quasi-ordinal. Kendals tau makes even fewer assumptions about the data (namely that it is not ~symmetric).
  • Kendal's tau is computed as a ratio of C (concordant) - D (discordant) pairs between two rows. For example for 10 datapoints in each row, there would be 45 ( = 10*(10-1)/2) possible pairs (every possible combo). You then count the number of times row A point 1 & 2 is strictly above/below row B's point 1 & 2 = concordant, versus umber of times row A point 1 & 2 is above & below row B's point 1 & 2 = discordant. Then this value is normalized.
  • Significance testing of correlation coefficients (for 2 variables): t-test with n-2 degrees of freedom.

references: [Pearson & Spearman] [more] [even more]