Search code examples
cluster-analysisdata-miningmahoutk-means

K Means Clustering using Mahout


I'm using the clustering technique given here for clustering a large dataset, which is given in Mahout examples. However, when I visualize the particular clustering I get the following figure.

Mahout k-means visualization.

I'm really struggling to understand what this actually means and have several questions.

  1. What does all the coloured lines indicate?
  2. What does so many clusters mean?
  3. Why are few areas crowded, and why aren't the other areas crowded?
  4. Why are few colored lines overlapping each other?

Solution

  • k-means is not the most advanced clustering technique. Circles as a visualization technique are misleading, it's actually partitioning the data space into Voronoi cells (look it up on Wikipedia). It also prefers similar-sized clusters.

    1. I assume that the different colors indicate the different iterations of k-means. It requires several runs to optimize its result (which usually only reaches a local minimum, and different runs will result in different results). So the results aren't very stable yet, I guess. They shift only slowly, which is why they don't overlap much.

    2. The number of clusters is a parameter for k-means. It's commonly denoted as k. k-means cannot determine the number of clusters, but you can test which result fits the data set best, if you run it with multiple values of k.

    3. k-means doesn't look at density. You need a density-based clustering algorithm for that. k-means prefers similar-sized clusters. Your "k" is probably too high.

    4. Since they are iteratively updated, the different iterations shouldn't overlap much.