User:Timothee Flutre/Notebook/Postdoc/2011/11/04
From OpenWetWare
< User:Timothee Flutre | Notebook | Postdoc | 2011 | 11
Main project page Previous entry Next entry
| |
K-means has the bad tendency to build clusters of similar size
for k in {1..8}; do vcluster matrix.txt $k -clabelfile=colnames.txt -plotclusters=plot_k${k}.ps -clustercolumns > stdout_k${k}; done
It works well and it still finishes on a large, "real" dataset. However, should I trust the results? Indeed, it is well-known that kmeans suffers from its tendency to build clusters of similar size. And, as shown by the figure below, it can provides bad results...
low.mean <- 0
high.mean <- 2
mysd <- 0.1
mult <- 1000
mydata.all <- rbind(matrix(rnorm(0.7*mult*3, mean=low.mean, sd=mysd), ncol=3, byrow=TRUE),
matrix(rnorm(0.2*mult*3, mean=high.mean, sd=mysd), ncol=3, byrow=TRUE),
matrix(c(rnorm(0.05*mult, mean=low.mean, sd=mysd), rnorm(0.1*mult, mean=high.mean, sd=mysd)), ncol=3, byrow=FALSE),
matrix(c(rnorm(0.05*mult, mean=high.mean, sd=mysd), rnorm(0.1*mult, mean=low.mean, sd=mysd)), ncol=3, byrow=FALSE))
mydata.all <- cbind(mydata.all, c(rep("000", 0.7*mult), rep("111", 0.2*mult), rep("011", 0.05*mult), rep("100", 0.05*mult)))
colnames(mydata.all) <- c("F", "L", "T", "truth")
head(mydata.all)
Now, let's use kmeans and plot the results: mydata <- matrix(as.numeric(mydata.all[sample(nrow(mydata.all)), 1:3]), ncol=3, byrow=FALSE)
colnames(mydata) <- c("F","L","T")
head(mydata)
res.km <- kmeans(mydata, 4)
aggregate(mydata, by=list(res.km$cluster), FUN=mean)
table(res.km$cluster)
library(scatterplot3d)
scatterplot3d(mydata[,"F"], mydata[,"L"], mydata[,"T"], color=res.km$cluster, main="kmeans")
It's pretty wrong, isn't it? And as a bonus, here is how to plot the corresponding heatmap (as I spent some time to find the proper way to do it): mydata.sort <- cbind(mydata, res.km$cluster)[order(res.km$cluster),] heatmap(mydata.sort[,1:3], Rowv=NA, Colv=NA, labRow=NA, scale="none", col=heat.colors(10)) | |



