Search code examples
cluster-analysismahoutk-means

Clustering classifier and clustering policy


I was going through the K-means algorithm in mahout and when debugging, I noticed that when creating the first clusters it does this following code:

ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta);
ClusterClassifier prior = new ClusterClassifier(clusters, policy);
prior.writeToSeqFiles(priorClustersPath); 

I was reading the description of these classes and it was not clear for me...

I was wondering what is the meaning of these cluster classifier and policy? is it related with hierarchical clustering, centroid based clustering, distribution based clustering etc?

Because I do not know what is the benefit or the reason of using this cluster classifier and policy when using K-means mahout implementation.


Solution

  • The implementation shares code with other variants of k-means and similar algorithms such as Canopy pre-clustering and GMM.

    These classes encode only the difference between these algorithms.

    Mahout is not a good place to study the k-means algorithm, the implementation is quite a mess. It's also slow. As in really really slow. Most of the time, a single CPU implementation will outright beat Mahout on anything that fits into memory. Maybe even on disk of a single machine. Because of all the map-reduce overhead.