Search code examples
pandasscikit-learnk-means

Retrieve Indices while performing K-Means algorithm


I have a data frame of following form;

dict_new={'var1':[1,0,1,0,2],'var2':[1,1,0,2,0],'var3':[1,1,1,2,1]}
pd.DataFrame(dict_new,index=['word1','word2','word3','word4','word5'])

Please note that actual dataset is quite big, above example is for simplicity. Then I performed K-means algorithm in sickit-learn, and took 2 cluster centroids for simplicity.

from sklearn.cluster import KMeans
num_clusters = 2
km = KMeans(n_clusters=num_clusters,verbose=1)
km.fit(dfnew.to_numpy())

Suppose the new cluster centroids are given by

centers=km.cluster_centers_
centers
array([[0.        , 1.5       , 1.5       ],
       [1.33333333, 0.33333333, 1.        ]])

The goal is to find two closest words for each cluster centroid, i.e. for each cluster center identify two closest words. I used the distance_matrix from scipy package, and got the output as a 2 x 5 matrix, corresponding to 2 centers and 5 words. Please see code below.

from scipy.spatial import distance_matrix
distance_matrix(centers,np.asmatrix(dfnew.to_numpy()))
array([[1.22474487, 0.70710678, 1.87082869, 0.70710678, 2.54950976],
   [0.74535599, 1.49071198, 0.47140452, 2.3570226 , 0.74535599]])

But we don't see the word indices here. So I am not being able to identify the two closest words for each centroid. Can I kindly get help on how we can retrieve the indices(which was defined in the original data frame). Help is appreciated.


Solution

  • Given that I understand what you want to do properly, here is a minimal working example on how to find the index of the words.

    First, let's generate a similar reproducible environement

    # import packages
    import pandas as pd
    import numpy as np
    from sklearn.cluster import KMeans
    from scipy.spatial.distance import cdist
    from scipy.spatial import distance_matrix
    
    # set up the DataFrame
    dict_new={'var1':[1,0,1,0,2],'var2':[1,1,0,2,0],'var3':[1,1,1,2,1]}
    df = pd.DataFrame(dict_new,index= ['word1','word2','word3','word4','word5'])
    
    # get the cluster centers
    kmeans = KMeans(n_clusters=2, random_state=0).fit(np.array(df))
    centers = kmeans.cluster_centers_
    

    If you only need to know the one closest word

    Now, if you wanted to use a distance matrix, you could do (instead):

    def closest(df, centers):
         # define the distance matrix
         mat = distance_matrix(centers, np.asmatrix(df.to_numpy()))
         # get an ordered list of the closest word for each cluster centroid
         closest_words = [df.index[i] for i in np.argmin(mat, axis=1)]
    
         return closest_words
    
    # example of it working for all centroids
    print(closest(df, centers))
    # > ['word3', 'word2']
    

    If you need to know the 2 closest words

    Now, if we want the two closest words:

    def two_closest(df, centers):
         # define the distance matrix
         mat = distance_matrix(centers, np.asmatrix(df.to_numpy()))
         # get an ordered list of lists of the closest two words for each cluster centroid
         closest_two_words = [[df.index[i] for i in l] for l in np.argsort(mat, axis=1)[:,0:2]]
         
         return closest_two_words
    
    # example of it working for all centroids
    print(two_closest(df, centers))
    # > [['word3', 'word5'], ['word2', 'word4']]
    

    Please tell if this is not what you wanted to do or if my answer does not fit your needs! And don't forget to mark the question as answered if I solved your problem.