Search code examples
pythonscikit-learnk-means

Kmean get good group separation


I have a basic code for use Kmeans with sklearn

    scaled = MinMaxScaler().fit_transform(points)

    kmeans = KMeans(n_clusters=nb_clusters) # , random_state=42 , init='random', algorithm='elkan'
    kmeans.fit(scaled)
    labels = kmeans.labels_
    centroids = kmeans.cluster_centers_

But i don't success to have my two groups well separated enter image description here

Does someone know what's wrong or what is lacking please ?

Thank you in advance.


Solution

  • The k-means algorithm is based on the assumption of spherically shaped clusters with a common diameter, e. g. isotropic Gaussian-distributed clusters. Your data does not fit this assumption.

    First you should plot your data with a 1:1 aspect ratio to better understand what is going on. You will see that the separation of your two groups circled in blue is smaller as compared to their spread in y-direction.

    Second, you should observe that the group to the right includes much, much less data points as compared to the group on the left. To the algorithm, the points on the right are mere outliers of the distribution it tries to model.

    To sum it up, k-means is probably not a good fit for your data. Mean shift, although maybe overkill, should be a much better fit.