I have performed hard clustering(using tf-idf weights) on a text corpus and obtained ~= 200 clusters. If I want to obtain the topic of each cluster, how do i do this?
I have tried using LDA on the raw text corpus(pre-clustering) and obtained many topics, however I am unsure on how to map these topics onto each of my existing clusters. Is there any other way to recommend or is LDA the right approach, and how do i proceed with it?
Online material shows only how to map lda topics onto document sentences, not pre-existing clusters. If I do so, and segment those sentences according to their assigned topics, i will get a different result from my original clusters(this is not ideal).
Thank you for the help in advance, pardon if there are any conceptual errors as I am rather new to NLP.
My approach would be to split your TFIDF doc-term matrix by assigned cluster and then sum together the tfidf scores of the terms (essentially summing all the rows together). This will give you the top words for each cluster.
If we assume that dtm
is your document term matrix, 'features' is a list of your terms in order of the dtm
columns and clusters
is your list of cluster labels in the same order as the rows in your dtm
then this should give you the top words for each cluster
import pandas as pd
def top_terms(df, top_n=5):
return df.sum().sort_values(ascending=False).head(top_n)
df = pd.DataFrame(dtm, columns=features)
df['labels'] = clusters
df.groupby('labels').apply(top_terms, top_n=10)