Search code examples
pythonscikit-learnk-means

Kmeans clustering changes for each training


I'm using sklearn Kmeans algorithm for grouping in 4 clusters multiple observations and I have included init_state and seed for obtaining always the same results; but each time that I reload the code in google colab and each time I'm running the training I obtain different results in terms of number of observations in each cluster, here the code:

 import numpy as np
 np.random.seed(5)
 from sklearn.cluster import KMeans
 kmeans = KMeans(n_clusters=4,init='k-means++',n_init=1,max_iter=3000,random_state=354)
 kmeans.fit(X)
 y_kmeans = kmeans.predict(X)

How I can obtain always the same results (in terms of the number of observation in each cluster)?

Thank you in advance


Solution

  • Here's from the doc

    If the algorithm stops before fully converging (because of ``tol`` or
    ``max_iter``), ``labels_`` and ``cluster_centers_`` will not be consistent,
    i.e. the ``cluster_centers_`` will not be the means of the points in each
    cluster. Also, the estimator will reassign ``labels_`` after the last
    iteration to make ``labels_`` consistent with ``predict`` on the training
    set.
    

    To get a good handle of max_iter, see k_means from scikit.cluster Setting return_n_iter to True gets best_n_iter which corresponds to the number of iterations to get the best results.

    Here's an example:

    centroids, best_iter = k_means(X, n_clusters=2, init='kmeans++', random_state=0, return_n_iter=True)