Search code examples
python-3.xvectorprobabilitylda

Get topic probability distribution for new document


I have a working topic model called model, with the following settings:

model = LdaModel(corpus=corpus,
                             id2word=id2word,
                             num_topics=10, 
                             random_state=100,
                             update_every=1,
                             chunksize=100,
                             passes=50,
                             alpha='auto',
                             per_word_topics=True)

This model is trained on data_words, a list with lists with a string of text in each list, such as:

data_words = [['This is the first text'],['This is another text'], ['Here is the very last text']]

In this case, len(data_words) is three, but with my actual data it's (around) 4000.

Based on the trained topic model, I would like to represent each of my 4000 documents from data_words as a topic probability distribution. For each document, this would be a num_topics-dimensional vector with each cell representing the probability that a topic is represented in that document.

Following the documentation, I have taken the following steps:

from gensim.corpora.dictionary import Dictionary
common_dictionary = Dictionary(data_words)
common_corpus = [common_dictionary.doc2bow(text) for text in data_words]

And to get the distribution, I ran: model[common_corpus[0]]

The output here is a tuple. Of which the first element model[common_corpus[0]][0] looks as follows:

[(0, 0.26094702),
 (1, 0.29876992),
 (3, 0.3244001),
 (7, 0.045543537),
 (8, 0.03196496),
 (9, 0.031232798)]

Is it correct that this is the topic distribution for the first document and that the probabilities of topic 2,4,5 and 6 are equal to zero? Or should I interpret this differently?

Ultimately, I would like to have a 4000xnum_topics matrix in which each cell represents the probability of a topic in a document. Assuming model[common_corpus[0]][0] is what I suspect it is, I could write a function to obtain that matrix from model[common_corpus[0]][i] for each document i. Are there quicker ways to obtain this matrix though?


Solution

  • Following https://radimrehurek.com/gensim/models/ldamodel.html, all the topics that have probability lower than the parameter minimum_probability will be discarded (default: 0.01). If you set minimum_probability=0, you will get the whole topic probability distribution of the document (in the form of tuples).

    As for your second question, I believe that the only way that allows you to obtain the topic-document distribution is the one above. So, you need to iterate over all the documents of your dataset to get the document-topic matrix.