Search code examples
pythonpandasdataframeldatopic-modeling

Pandas: LDA Top n keywords and topics with weights


I am doing a topic modelling task with LDA, and I am getting 10 components with 15 top words each:

for index, topic in enumerate(lda.components_):
    print(f'Top 10 words for Topic #{index}')
    print([vectorizer.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

prints:

Top 10 words for Topic #0
['compile', 'describes', 'info', 'extent', 'changing', 'reader', 'reservation', 'countries', 'printed', 'clear', 'line', 'passwords', 'situation', 'tables', 'downloads']

Now I would like to create a pandas dataframe to show each topic (index) with all the keywords (rows) and see their weights. I'd like the keywords not present in a topic to have 0 weight but I cant get it to work. I have this so far, but it prints all the feature names (aroud 1700). How can I set it only for the top 10 for each topic?

topicnames = ['Topic' + str(i) for i in range(lda.n_components)]
# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(lda_model.components_)
# Assign Column and Index
df_topic_keywords.columns = vectorizer.get_feature_names()
df_topic_keywords.index = topicnames
# View
df_topic_keywords.head()

Solution

  • If I understand correctly, you have a dataframe with all values and you want to keep the top 10 in each row, and have 0s on remaining values.

    Here we transform each row by:

    • getting the 10th highest values
    • reindexing to the original index of the row (thus the columns of the dataframe) and filling with 0s:
    >>> df.transform(lambda s: s.nlargest(10).reindex(s.index, fill_value=0), axis='columns')
        a  b   c   d   e  f   g   h   i   j   k  l   m   n   o   p  q   r  s   t   u   v  w  x  y
    a   0  0  63  98   0  0  73   0  78   0  94  0   0  63  68  98  0   0  0  67   0  77  0  0  0
    z  76  0   0   0  84  0  62  61   0  93   0  0  82  70   0   0  0  91  0   0  48  95  0  0  0