Search code examples
machine-learningapache-sparkcluster-analysisdistributed-computingk-means

Using Silhouette Clustering in Spark


I want to use silhouette to determine optimal value for k when using KMeans clustering in Spark. Is there any optimal way parallelize this? i.e. make it scalable


Solution

  • No, the silhouette by definition is not scalable.

    It uses pairwise distances, this will always take O(n^2) time to compute.

    You will need to use something different. Using Silhouette on large data is absurd, it takes much longer to compute the evaluation measure than to run the actual k-means clustering algorithm.

    Or reconsider what you are doing. Does it make sense to use the silhouette at all, for example. You could also decide to run something faster than Spark on single nodes, compute the Silhouette there, and simply parallelize via k, without all the overhead of distributed computation. Spark may win against MapReduce-Mahout, but it will lose against a good non-distributed implementation.