Search code examples
nlpcluster-analysisldatopic-modeling

How to extract topics from existing text clusters?


I have performed hard clustering(using tf-idf weights) on a text corpus and obtained ~= 200 clusters. If I want to obtain the topic of each cluster, how do i do this?

I have tried using LDA on the raw text corpus(pre-clustering) and obtained many topics, however I am unsure on how to map these topics onto each of my existing clusters. Is there any other way to recommend or is LDA the right approach, and how do i proceed with it?

Online material shows only how to map lda topics onto document sentences, not pre-existing clusters. If I do so, and segment those sentences according to their assigned topics, i will get a different result from my original clusters(this is not ideal).

Thank you for the help in advance, pardon if there are any conceptual errors as I am rather new to NLP.


Solution

  • My approach would be to split your TFIDF doc-term matrix by assigned cluster and then sum together the tfidf scores of the terms (essentially summing all the rows together). This will give you the top words for each cluster.

    If we assume that dtm is your document term matrix, 'features' is a list of your terms in order of the dtm columns and clusters is your list of cluster labels in the same order as the rows in your dtm then this should give you the top words for each cluster

    import pandas as pd
    
    def top_terms(df, top_n=5):
        return df.sum().sort_values(ascending=False).head(top_n)
    
    df = pd.DataFrame(dtm, columns=features)
    
    df['labels'] = clusters
    
    df.groupby('labels').apply(top_terms, top_n=10)