Search code examples
pythoncluster-analysisk-meansfrequencytf-idf

Python Kmeans Print absolute frequency of words in each cluster


hello is there a way to print out the absolute frequencies of each word in a cluster? My Code looks like this:

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(list)

true_k = 4

model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)

model.fit(X)

print("Top terms per cluster:")

order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i,)
    for ind in order_centroids[i, :5]:
        print(' %s' % terms[ind],)
    print

My results are e.g.:

Top Terms per Cluster:

Cluster 0:

house

roof

table

chair

tv

Cluster 1:

...

But I want something like this, with absolute frequencies of each word:

Cluster 0:

house 65

roof 45

table 44

chair 33

tv 18

Thank you in advance :)


Solution

  • Not sure what is the need of tfidfvectorizer on words. But anyway using kmeans just predict on the cluster label for each word. And simply check word frequency in each cluster by doing a df[df.cluster==#somelabel].words.value_counts

    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.cluster import KMeans
    
    words = ['this','is','a','very','long','text','my','name','is','not','cortana','today','I','will',
    'write','a','long','text','I','am','from','planet','earth','this','text','does','not','make',
     'sense']
    
    #tfidf
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(words)
    
    #kmeans
    true_k = 4
    model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
    model.fit(X)
    lab = model.predict(X)
    
    #save cluster labels for each sample in a dataframe 
    df = pd.DataFrame({'words':words, 'cluster':lab})
    
    #check word freq for cluster==1
    df[df.cluster==1].words.value_counts()