Search code examples
machine-learningscikit-learncluster-analysisk-means

Changing label names of Kmean clusters


I am doing the kmean clustering through sklearn in python. I am wondering how to change the generated label name for kmean clusters. For example:

data          Cluster
0.2344         1
1.4537         2
2.4428         2
5.7757         3

And I want to achieve to

data          Cluster
0.2344         black
1.4537         red
2.4428         red
5.7757         blue

I am not meaning to directly set1 -> black; 2 -> redby printing. I am wondering is it possible to set different cluster names in kmean clustering model in default.


Solution

  • No
    There isn't any way to change the default labels.
    You have to map them separately using a dictionary. You can take look at all available methods in the documentation here.
    None of the available methods or attributes allows you to change the default labels.

    Solution using dictionary:

    # Code
    a = [0,0,1,1,2,2]
    mapping = {0:'black', 1:'red', 2:'blue'}
    a = [mapping[i] for i in a]
    
    # Output
    ['black', 'black', 'red', 'red', 'blue', 'blue']
    

    If you change your data or number of clusters: First we will see the visualizations:
    Code:
    Importing and generating random data:

    from sklearn.cluster import KMeans
    import numpy as np
    import matplotlib.pyplot as plt
    
    x = np.random.uniform(100, size =(10,2))
    

    Applying Kmeans algorithm

    kmeans = KMeans(n_clusters=3, random_state=0).fit(x)
    

    Getting cluster centers

    arr = kmeans.cluster_centers_
    

    Your cluster centroids look like this:

    array([[23.81072765, 77.21281171],
           [ 8.6140551 , 23.15597377],
           [93.37177176, 32.21581703]])
    

    Here, 1st row is the centroid of cluster 0, 2nd row is centroid of cluster 1 and so on.

    Visualizing centroids and data:

    plt.scatter(x[:,0],x[:,1])
    plt.scatter(arr[:,0], arr[:,1])
    

    You get a graph that looks like this: My graph.

    As you can see, you have access to centroids as well as training data. If your training data and number of clusters is constant these centroids dont really change.

    But if you add more training data or more number of clusters then you will have to create new mapping according to the centroids that are generated.