Distance & Clustering
I could not run Matisse using Pearson (or any other distance metric) other than Spearman. Spearman is a non-parametric correlation metric.
Here are some notes on distances:
- Pearson r is equal to the cosine (or vector dot product) of z-score transformed data.
ρX,Y = (for i=1..n Σ Z(X)i·Z(Y)i) ÷ (n-1)
- Z-score: z(xi) = (xi – μx)/θx (mean centered / standard dev normalized data).
- Pearson assumes normality & is affected by outliers.
- Spearman does not require normality (has some assumptions about symmetry in the distribution).
T = for i=1..n Σ(rank(Xi) - rank(Yi))
rs = 1 - (6T ÷ (n3 - n)
- Results from Pearsons & ordinal correlation coeff (like Spearman & Kendal tau) are very similar when 2 conditions are met: no outliers & there is not really a pronounced non-linear relationship.
- Spearman & Pearson are most similar for ~linear relationships. This is because ordering a interval data, linearizes the relationship i.e. a exponentially increasing trend when ordered is a simple increasing trend (flattening the end of the exponentially increasing trend). The pearsons & spearmans could be very different.
- This can be used as follows:
If Pearson's r is much smaller than Spearman's rho applied to the same variables, we can conclude that the variables ARE consistently correlated,
but NOT in a linear fashion. When both correlation coefficients yield very similar values different from zero, there is indication of a linear
- Outliers: Spearman is more robust to outliers. Case in point: a outlier value can increase without changing its rank. This won't change Spearmans, but will change Pearsons.
- Spearmans is quasi-ordinal. Kendals tau makes even fewer assumptions about the data (namely that it is not ~symmetric).
- Kendal's tau is computed as a ratio of C (concordant) - D (discordant) pairs between two rows. For example for 10 datapoints in each row, there would be 45 ( = 10*(10-1)/2) possible pairs (every possible combo). You then count the number of times row A point 1 & 2 is strictly above/below row B's point 1 & 2 = concordant, versus umber of times row A point 1 & 2 is above & below row B's point 1 & 2 = discordant. Then this value is normalized.
- Significance testing of correlation coefficients (for 2 variables): t-test with n-2 degrees of freedom.
[Pearson & Spearman]