I am doing a topic modelling task with LDA, and I am getting 10 components with 15 top words each:
for index, topic in enumerate(lda.components_):
print(f'Top 10 words for Topic #{index}')
print([vectorizer.get_feature_names()[i] for i in topic.argsort()[-10:]])
print('\n')
prints:
Top 10 words for Topic #0
['compile', 'describes', 'info', 'extent', 'changing', 'reader', 'reservation', 'countries', 'printed', 'clear', 'line', 'passwords', 'situation', 'tables', 'downloads']
Now I would like to create a pandas dataframe to show each topic (index) with all the keywords (rows) and see their weights. I'd like the keywords not present in a topic to have 0 weight but I cant get it to work. I have this so far, but it prints all the feature names (aroud 1700). How can I set it only for the top 10 for each topic?
topicnames = ['Topic' + str(i) for i in range(lda.n_components)]
# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(lda_model.components_)
# Assign Column and Index
df_topic_keywords.columns = vectorizer.get_feature_names()
df_topic_keywords.index = topicnames
# View
df_topic_keywords.head()
If I understand correctly, you have a dataframe with all values and you want to keep the top 10 in each row, and have 0s on remaining values.
Here we transform
each row by:
>>> df.transform(lambda s: s.nlargest(10).reindex(s.index, fill_value=0), axis='columns')
a b c d e f g h i j k l m n o p q r s t u v w x y
a 0 0 63 98 0 0 73 0 78 0 94 0 0 63 68 98 0 0 0 67 0 77 0 0 0
z 76 0 0 0 84 0 62 61 0 93 0 0 82 70 0 0 0 91 0 0 48 95 0 0 0