Search code examples
python-3.xscikit-learnk-means

How to find which text is close to the center of kmeans clusters


I have a list of text, I already perform tfidf and kmeans cluster, how do I access which text closest to the center of the kmeans cluster.

text=['this is text one','this is text two','this is text three',
     'thats are next','that are four','that are three',
     'lionel messi is footbal player','kobe bryant is basket ball player',
     'rossi is motogp racer']
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(text)
cluster_text = Tfidf_vect.transform(text)
kmeans = KMeans(n_clusters=3, random_state=0,max_iter=600,n_init=10)
kmeans.fit(cluster_text)
labels = (kmeans.labels_)
center=kmeans.cluster_centers_

Expected output :

closest text to the center cluster 1=['this is text two','this is text three']
closest text to the center cluster 2=['that are three','that are four']
closest text to the center cluster 3=['rossi is motogp racer']

Thank you for your help


Solution

  • You can use the cosine similarity between the tfidf representation of each text and the cluster centers. Try this!

    from sklearn.metrics import pairwise_distances
    
    distances = pairwise_distances(cluster_text, kmeans.cluster_centers_, 
                                   metric='cosine')
    
    ranking = np.argsort(distances, axis=0)
    
    df = pd.DataFrame({'text': text})
    for i in range(kmeans.n_clusters):
        df['cluster_{}'.format(i)] = ranking[:,i]
    
    top_n = 2
    
    for i in range(kmeans.n_clusters):
        print('top_{} closest text to the cluster {} :'.format(top_n, i))
        print(df.nsmallest(top_n,'cluster_{}'.format(i))[['text']].values)
    
    top_2 closest text to the cluster 0 :
    [['that are four']
     ['that are three']]
    top_2 closest text to the cluster 1 :
    [['thats are next']
     ['that are four']]
    top_2 closest text to the cluster 2 :
    [['this is text three']
     ['this is text two']]