python apache-spark machine-learning scikit-learn cluster-analysis

Different silhouette scores when using sklearn and Spark

When I compute the silhouette score with the same data and same prediction values while using Spark and sklearn I get different results.

Here is the code using for Spark:

>>> prediction.show()
+---+---+---------+----------+
|  a|  b| features|prediction|
+---+---+---------+----------+
|  1|  1|[1.0,1.0]|         1|
|  2|  2|[2.0,2.0]|         1|
|  3|  3|[3.0,3.0]|         0|
|  4|  4|[4.0,4.0]|         0|
+---+---+---------+----------+

>>> from pyspark.ml.evaluation import ClusteringEvaluator
>>> evaluator = ClusteringEvaluator()
>>> silhouette = evaluator.evaluate(prediction)
>>> silhouette
0.7230769230769223

Here is the code using for sklearn:

>>> from sklearn.cluster import KMeans
>>> from sklearn import metrics
>>> x=[[1,1],[2,2],[3,3],[4,4]]
>>> prediction = KMeans(n_clusters=2,max_iter=1000,random_state=123).fit_predict(x)
>>> prediction
array([1, 1, 0, 0], dtype=int32)
>>> silhouette = metrics.silhouette_score(x, prediction)
>>> silhouette
0.46666666666666673

As shown above, the scores are very different when though the inputs are the same. Why is that?

Solution

The main difference is that different distance measures are used.

Spark uses an the squared euclidean as the distance measure as compared to sklearn that uses normal euclidean distance by default.

The reason for this choice of distance measure in Spark is to allow for more a efficient and parallel computation. Parts of the equation can be precomputed, reducing the computational complexity from O(N^2^*D), where N is the number of points and D their dimension, to O(C*D*N/W) where W is the number of workers and C the number of clusters (assumed to be quite low). The Spark math derivation and implementation of the silhouette score is documented on github (here).

Proof:

We can analyse the example in the question and calculate the silhouette scores by hand using both euclidean distance and squared euclidean distance.

We have cluster 1 with points (1,1), (2,2) and a cluster center at (1.5,1.5) and cluster 2 with (3,3), (4,4) and a cluster center at (3.5,3.5).

The final silhouette score is the mean of the silhouette scores of all samples. Since the four points in the question are perfectly mirrored and there are only two clusters, it's enough to calculate the score for one of these clusters (here I selected cluster 1).

Below a is the mean intra-cluster distance (mean distance to all point in the same cluster) and b the mean inter-cluster distance (mean distance to all points in the closest cluster that the point does not belong to). The score is calculated as (b-a) / max(b,a).

Euclidean distance:

Point (1,1):
- a = sqrt((2-1)^2 + (2-1)^2) = sqrt(2)
- b = (sqrt((3-1)^2 + (3-1)^2) + sqrt((4-1)^2 + (4-1)^2))) / 2 = (sqrt(8) + sqrt(18)) / 2 = 3.5355
- point score = (3.5355 - sqrt(2)) / 3.5355 = 0.6
Point (2,2):
- a = sqrt((2-1)^2 + (2-1)^2) = sqrt(2)
- b = (sqrt((3-2)^2 + (3-2)^2) + sqrt((4-2)^2 + (4-2)^2))) / 2 = (sqrt(2) + sqrt(8)) / 2 = 2.1213
- point score = (2.1213 - sqrt(2)) / 2.1213 = 0.33333

Silhouette score = (0.6 + 0.33333) / 2 = 0.4666667

Squared euclidean distance:

Point (1,1):
- a = (2-1)^2 + (2-1)^2 = 2
- b = ((3-1)^2 + (3-1)^2 + (4-1)^2 + (4-1)^2) / 2 = (8 + 18) / 2 = 13
- point score = (13 - 2) / 13 = 0.84615
Point (2,2):
- a = (2-1)^2 + (2-1)^2 = 2
- b = ((3-2)^2 + (3-2)^2 + (4-2)^2 + (4-2)^2) / 2 = (2 + 8) / 2 = 5
- point score = (5 - 2) / 5 = 0.6

Silhouette score = (0.84615 + 0.6) / 2 = 0.723075