I have a working topic model called model
, with the following settings:
model = LdaModel(corpus=corpus,
This model is trained on data_words
, a list with lists with a string of text in each list, such as:
data_words = [['This is the first text'],['This is another text'], ['Here is the very last text']]
In this case, len(data_words)
is three, but with my actual data it's (around) 4000.
Based on the trained topic model, I would like to represent each of my 4000 documents from data_words
as a topic probability distribution. For each document, this would be a num_topics-dimensional vector with each cell representing the probability that a topic is represented in that document.
Following the documentation, I have taken the following steps:
from gensim.corpora.dictionary import Dictionary
common_dictionary = Dictionary(data_words)
common_corpus = [common_dictionary.doc2bow(text) for text in data_words]
And to get the distribution, I ran:
The output here is a tuple. Of which the first element model[common_corpus[0]][0]
looks as follows:
[(0, 0.26094702),
(1, 0.29876992),
(3, 0.3244001),
(7, 0.045543537),
(8, 0.03196496),
(9, 0.031232798)]
Is it correct that this is the topic distribution for the first document and that the probabilities of topic 2,4,5 and 6 are equal to zero? Or should I interpret this differently?
Ultimately, I would like to have a 4000xnum_topics matrix in which each cell represents the probability of a topic in a document. Assuming model[common_corpus[0]][0]
is what I suspect it is, I could write a function to obtain that matrix from model[common_corpus[0]][i]
for each document i
. Are there quicker ways to obtain this matrix though?
Following https://radimrehurek.com/gensim/models/ldamodel.html, all the topics that have probability lower than the parameter minimum_probability will be discarded (default: 0.01). If you set minimum_probability=0, you will get the whole topic probability distribution of the document (in the form of tuples).
As for your second question, I believe that the only way that allows you to obtain the topic-document distribution is the one above. So, you need to iterate over all the documents of your dataset to get the document-topic matrix.