User:Timothee Flutre/Notebook/Postdoc/2011/11/04
From OpenWetWare
Project name | Main project page Previous entry Next entry |
K-means has the bad tendency to build clusters of similar size
for k in {1..8}; do vcluster matrix.txt $k -clabelfile=colnames.txt -plotclusters=plot_k${k}.ps -clustercolumns > stdout_k${k}; done It works well and it still finishes on a large, "real" dataset. However, should I trust the results? Indeed, it is well-known that kmeans suffers from its tendency to build clusters of similar size. And, as shown by the figure below, it can provides bad results...
low.mean <- 0 high.mean <- 2 mysd <- 0.1 mult <- 1000 mydata.all <- rbind(matrix(rnorm(0.7*mult*3, mean=low.mean, sd=mysd), ncol=3, byrow=TRUE), matrix(rnorm(0.2*mult*3, mean=high.mean, sd=mysd), ncol=3, byrow=TRUE), matrix(c(rnorm(0.05*mult, mean=low.mean, sd=mysd), rnorm(0.1*mult, mean=high.mean, sd=mysd)), ncol=3, byrow=FALSE), matrix(c(rnorm(0.05*mult, mean=high.mean, sd=mysd), rnorm(0.1*mult, mean=low.mean, sd=mysd)), ncol=3, byrow=FALSE)) mydata.all <- cbind(mydata.all, c(rep("000", 0.7*mult), rep("111", 0.2*mult), rep("011", 0.05*mult), rep("100", 0.05*mult))) colnames(mydata.all) <- c("F", "L", "T", "truth") head(mydata.all) Now, let's use kmeans and plot the results: mydata <- matrix(as.numeric(mydata.all[sample(nrow(mydata.all)), 1:3]), ncol=3, byrow=FALSE) colnames(mydata) <- c("F","L","T") head(mydata) res.km <- kmeans(mydata, 4) aggregate(mydata, by=list(res.km$cluster), FUN=mean) table(res.km$cluster) library(scatterplot3d) scatterplot3d(mydata[,"F"], mydata[,"L"], mydata[,"T"], color=res.km$cluster, main="kmeans") It's pretty wrong, isn't it? And as a bonus, here is how to plot the corresponding heatmap (as I spent some time to find the proper way to do it): mydata.sort <- cbind(mydata, res.km$cluster)[order(res.km$cluster),] heatmap(mydata.sort[,1:3], Rowv=NA, Colv=NA, labRow=NA, scale="none", col=heat.colors(10)) |