Search code examples
pythonsortingnumpyscikit-learnk-means

How to set k-Means clustering labels from highest to lowest with Python?


I have a dataset of 38 apartments and their electricity consumption in the morning, afternoon and evening. I am trying to clusterize this dataset using the k-Means implementation from scikit-learn, and am getting some interesting results.

First clustering results: Img

This is all very well, and with 4 clusters I obviously get 4 labels associated to each apartment - 0, 1, 2 and 3. Using the random_state parameter of KMeans method, I can fix the seed in which the centroids are randomly initialized, so consistently I get the same labels attributed to the same apartments.

However, as this specific case is in regards of energy consumption, a measurable classification between the highest and the lowest consumers can be performed. I would like, thus, to assign the label 0 to the apartments with lowest consumption level, label 1 to apartments that consume a bit more and so on.

As of now, my labels are [2 1 3 0], or ["black", "green", "blue", "red"]; I would like them to be [0 1 2 3] or ["red", "green", "black", "blue"]. How should I proceed to do so, while still keeping the centroid initialization random (with fixed seed)?

Thank you very much for the help!


Solution

  • Transforming the labels through a lookup table is a straightforward way to achieve what you want.

    To begin with I generate some mock data:

    import numpy as np
    
    np.random.seed(1000)
    
    n = 38
    X_morning = np.random.uniform(low=.02, high=.18, size=38)
    X_afternoon = np.random.uniform(low=.05, high=.20, size=38)
    X_night = np.random.uniform(low=.025, high=.175, size=38)
    X = np.vstack([X_morning, X_afternoon, X_night]).T
    

    Then I perform clustering on data:

    from sklearn.cluster import KMeans
    k = 4
    kmeans = KMeans(n_clusters=k, random_state=0).fit(X)
    

    And finally I use NumPy's argsort to create a lookup table like this:

    idx = np.argsort(kmeans.cluster_centers_.sum(axis=1))
    lut = np.zeros_like(idx)
    lut[idx] = np.arange(k)
    

    Sample run:

    In [70]: kmeans.cluster_centers_.sum(axis=1)
    Out[70]: array([ 0.3214523 ,  0.40877735,  0.26911353,  0.25234873])
    
    In [71]: idx
    Out[71]: array([3, 2, 0, 1], dtype=int64)
    
    In [72]: lut
    Out[72]: array([2, 3, 1, 0], dtype=int64)
    
    In [73]: kmeans.labels_
    Out[73]: array([1, 3, 1, ..., 0, 1, 0])
    
    In [74]: lut[kmeans.labels_]
    Out[74]: array([3, 0, 3, ..., 2, 3, 2], dtype=int64)
    

    idx shows the cluster center labels ordered from lowest to highest consumption level. The appartments for which lut[kmeans.labels_] is 0 / 3 belong to the cluster with the lowest / highest consumption levels.