User:Matthew Whiteside/Notebook/Fumigatus Microarray/2009/05/08

{| width="800"
 * style="background-color: #EEE"|[[Image:owwnotebook_icon.png|128px]] Project name
 * style="background-color: #F2F2F2" align="center"|  |Main project page
 * style="background-color: #F2F2F2" align="center"|  |Main project page


 * colspan="2"|
 * colspan="2"|

Distance & Clustering
I could not run Matisse using Pearson (or any other distance metric) other than Spearman. Spearman is a non-parametric correlation metric.

Here are some notes on distances:

Pearson: ρ(X,Y) ρX,Y = (for i=1..n Σ Z(X)i·Z(Y)i) ÷ (n-1) Spearman r T = for i=1..n Σ(rank(Xi) - rank(Yi)) rs = 1 - (6T ÷ (n3 - n) If Pearson's r is much smaller than Spearman's rho applied to the same variables, we can conclude that the variables ARE consistently correlated, but NOT in a linear fashion. When both correlation coefficients yield very similar values different from zero, there is indication of a linear relationship.
 * Pearson r is equal to the cosine (or vector dot product) of z-score transformed data.
 * Z-score: z(xi) = (xi – μx)/θx (mean centered / standard dev normalized data).
 * Pearson assumes normality & is affected by outliers.
 * Spearman does not require normality (has some assumptions about symmetry in the distribution).
 * Results from Pearsons & ordinal correlation coeff (like Spearman & Kendal tau) are very similar when 2 conditions are met: no outliers & there is not really a pronounced non-linear relationship.
 * Spearman & Pearson are most similar for ~linear relationships. This is because ordering a interval data, linearizes the relationship i.e. a exponentially increasing trend when ordered is a simple increasing trend (flattening the end of the exponentially increasing trend). The pearsons & spearmans could be very different.
 * This can be used as follows:
 * Outliers: Spearman is more robust to outliers. Case in point: a outlier value can increase without changing its rank. This won't change Spearmans, but will change Pearsons.
 * Spearmans is quasi-ordinal. Kendals tau makes even fewer assumptions about the data (namely that it is not ~symmetric).
 * Kendal's tau is computed as a ratio of C (concordant) - D (discordant) pairs between two rows. For example for 10 datapoints in each row, there would be 45 ( = 10*(10-1)/2) possible pairs (every possible combo). You then count the number of times row A point 1 & 2 is strictly above/below row B's point 1 & 2 = concordant, versus umber of times row A point 1 & 2 is above & below row B's point 1 & 2 = discordant. Then this value is normalized.
 * Significance testing of correlation coefficients (for 2 variables): t-test with n-2 degrees of freedom.

references: [Pearson & Spearman] [more] [even more]


 * }