Search code examples
pythonscikit-learnk-means

Comparing k-means clustering results between scikit-learn versions 1.2.2 and 1.3.1


I have encountered an issue with k-means clustering using scikit-learn in Python, where the clustering results seem to be inconsistent between versions 1.2.2 and 1.3.1.

When I set the number of clusters (k) to 3, the clustering results are inconsistent between the two versions.

Here’s a snippet of code(using Toy datasets):

from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans
import numpy as np

digits = load_digits()
X = digits.data

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans.fit_predict(X_scaled)

sample_silhouette_values = silhouette_samples(X_scaled, cluster_labels)
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
y_lower = 10  

sil_score = silhouette_score(X_scaled, cluster_labels)

for i in range(3): 
    ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
    ith_cluster_silhouette_values.sort()
    size_cluster_i = ith_cluster_silhouette_values.shape[0]
    y_upper = y_lower + size_cluster_i
    ax.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values, alpha=0.7)
    ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
    y_lower = y_upper + 10  

ax.set_xlabel("The silhouette coefficient values")
ax.axvline(x=sil_score, color="red", linestyle="--")
ax.set_title(f"Silhouette analysis for KMeans clustering with n_clusters = {i}")
ax.set_title("Silhouette plot for the various clusters")
plt.show()

Discrepancies Observed:

We’ve noticed discrepancies in our results right from the fit and fit_predict methods. Upon reviewing the changelog for scikit-learn, I observed that there have indeed been updates between these two versions. However, I'm unsure if these changes are the reason behind the discrepancies in our clustering results.

Queries:

  • Could the version differences between scikit-learn 1.2.2 and 1.3.1 cause variations in the k-means clustering outcomes? What is the reason for the difference?

  • Which of these clustering results should be considered correct?

Thank you for your assistance!


Solution

  • Is the change between versions significantly more important than changing the random_state value for a given version? For instance try to compute the Silhouette coefficient for the clustering with random_state values from 0 to 99 and compute the average and standard deviation.

    If the change across version is on the same order as the standard deviation measured when varying the random_state seed, you can consider that both versions are equally correct and you should ignore small variations of the Silhouette coefficient.

    You can also increase n_init for instance to n_init=10 or even n_init=30 to get more stable (and higher quality results) at the cost of longer training times.

    Finally you might want to have a look at https://github.com/gittar/breathing-k-means as a more stable alternative to the traditional KMeans algorithm implemented by default in scikit-learn.