Search code examples
pythonsortingcluster-analysisk-means

Python kmeans : Sort cluster labels based on the value of their centroid


I have this simple kmeans algorithm that I apply on a list of float lists :

def clustering(k,lists_to_cluster):
    max_vals = [max(sublist) for sublist in lists_to_cluster]
    kmeans_ampl = KMeans(k, random_state=123).fit(np.array(max_vals).reshape(-1,1))
    centroids_ampl = kmeans_ampl.labels_ 
    return centroids_ampl

centroids_labels = clustering(3,lists_to_cluster)

centroids_labels returns [0,0,1,2,2,0]but the lists with highest mex_vals are labeled 0. I'd like to cluster labels to be sorted in a max_vals ascending order (label 0 is assigned to the lists with lowest max_vals, etc until label k-1 with highest max_vals). Is there a way to do it before/during applying kmeans or should I just sort and map after applying it ? Thanks !


Solution

  • You can group the maxvals by cluster into a dictionary that maps cluster label to list of maxvals.

    Then sort the dictionary values (the lists) by min maxval, or max maxval, or whatever.

    def relabel(labels, vals):
        d = {}
        for k, v in zip(labels, vals):
            d.setdefault(k, []).append(v)
        return list(enumerate(sorted(d.values(), key=min))) # or key=max, or key=statistics.mean
    
    lists_to_cluster = [[1], [2], [3], [6], [7], [8], [101], [102], [103]]
    max_vals = [max(sublist) for sublist in lists_to_cluster]
    centroids_labels = clustering(3,lists_to_cluster)
    print( relabel(centroids_labels, max_vals) )
    # [(0, [1, 2, 3]), (1, [6, 7, 8]), (2, [101, 102, 103])]