I have encountered an issue with k-means clustering using scikit-learn in Python, where the clustering results seem to be inconsistent between versions 1.2.2 and 1.3.1.
When I set the number of clusters (k) to 3, the clustering results are inconsistent between the two versions.
Here’s a snippet of code(using Toy datasets):
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans
import numpy as np
digits = load_digits()
X = digits.data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans.fit_predict(X_scaled)
sample_silhouette_values = silhouette_samples(X_scaled, cluster_labels)
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
y_lower = 10
sil_score = silhouette_score(X_scaled, cluster_labels)
for i in range(3):
ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
ax.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values, alpha=0.7)
ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
y_lower = y_upper + 10
ax.set_xlabel("The silhouette coefficient values")
ax.axvline(x=sil_score, color="red", linestyle="--")
ax.set_title(f"Silhouette analysis for KMeans clustering with n_clusters = {i}")
ax.set_title("Silhouette plot for the various clusters")
Discrepancies Observed:
We’ve noticed discrepancies in our results right from the fit and fit_predict methods. Upon reviewing the changelog for scikit-learn, I observed that there have indeed been updates between these two versions. However, I'm unsure if these changes are the reason behind the discrepancies in our clustering results.
Could the version differences between scikit-learn 1.2.2 and 1.3.1 cause variations in the k-means clustering outcomes? What is the reason for the difference?
Which of these clustering results should be considered correct?
Thank you for your assistance!
Is the change between versions significantly more important than changing the random_state
value for a given version? For instance try to compute the Silhouette coefficient for the clustering with random_state values from 0 to 99 and compute the average and standard deviation.
If the change across version is on the same order as the standard deviation measured when varying the random_state seed, you can consider that both versions are equally correct and you should ignore small variations of the Silhouette coefficient.
You can also increase n_init
for instance to n_init=10
or even n_init=30
to get more stable (and higher quality results) at the cost of longer training times.
Finally you might want to have a look at https://github.com/gittar/breathing-k-means as a more stable alternative to the traditional KMeans
algorithm implemented by default in scikit-learn.