I'm using the clustering technique given here for clustering a large dataset, which is given in Mahout examples. However, when I visualize the particular clustering I get the following figure.
I'm really struggling to understand what this actually means and have several questions.
k-means is not the most advanced clustering technique. Circles as a visualization technique are misleading, it's actually partitioning the data space into Voronoi cells (look it up on Wikipedia). It also prefers similar-sized clusters.
I assume that the different colors indicate the different iterations of k-means. It requires several runs to optimize its result (which usually only reaches a local minimum, and different runs will result in different results). So the results aren't very stable yet, I guess. They shift only slowly, which is why they don't overlap much.
The number of clusters is a parameter for k-means. It's commonly denoted as k
. k-means cannot determine the number of clusters, but you can test which result fits the data set best, if you run it with multiple values of k.
k-means doesn't look at density. You need a density-based clustering algorithm for that. k-means prefers similar-sized clusters. Your "k" is probably too high.
Since they are iteratively updated, the different iterations shouldn't overlap much.