Search code examples
pythonscikit-learnldatopic-modeling

Get most probable words for each topic


I made an LDA model with sklearn but, as weird as it sounds, I cannot find anything online about how to get the top-words. This is my code:

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
doc_term_matrix = count_vect.fit_transform(tweet_tp['text'].values.astype('U'))
doc_term_matrix


from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(n_components=3, random_state=1)
id_topic = LDA.fit(doc_term_matrix)

Once I added this:

import numpy as np
vocab = count_vect.get_feature_names()

topic_words = {}
for topic, comp in enumerate(LDA.components_):
    word_idx = np.argsort(comp)[::-1][:5]

topic_words[topic] = [vocab[i] for i in word_idx]

for topic, words in topic_words.items():
    print('Topic: %d' % topic)
    print('  %s' % ', '.join(words))

which I found on an answer here but I can't find it at the moment. Hoever, this outputs only the second topics words.


Solution

  • You can use ntopwlst like so:

    from sklearn.feature_extraction.text import CountVectorizer
    
    count_vect = CountVectorizer()
    doc_term_matrix = count_vect.fit_transform(tweet_tp['text'].values.astype('U'))
    
    from sklearn.decomposition import LatentDirichletAllocation
    
    LDA = LatentDirichletAllocation(n_components=3, random_state=1)
    id_topic = LDA.fit(doc_term_matrix)
    
    def ntopwlst(model, features, ntopwords):
        '''create a list of the top topc words'''
        output = []
        for topic_idx, topic in enumerate(model.components_): # compose output message with top words
            output.append(str(topic_idx))
            output += [features[i] for i in topic.argsort()[:-ntopwords - 1:-1]] # [start (0 if omitted): end : slicing increment]
        return output
    
    ntopwords = 5 # change this to show more words for the topic selector (20)
    tf_feature_names = count_vect.get_feature_names()
    topwds = ntopwlst(LDA, tf_feature_names, ntopwords)
    

    You did extract the vocabulary but this is easier than handling the LDA-results directly. I was not able to test this since I lack the tweet_tp data so use with caution.