Search code examples
pythondictionarygensimldatopic-modeling

Access dictionary in Python gensim topic model


I would like to see how to access dictionary from gensim lda topic model. This is particularly important when you train lda model, save and load it later on. In the other words, suppose lda_model is the model trained on a collection of documents. To get document-topic matrix one can do something like below or something like the one explained in https://www.kdnuggets.com/2019/09/overview-topics-extraction-python-latent-dirichlet-allocation.html:

def regTokenize(text):
    # tokenize the text into words
    import re
    WORD = re.compile(r'\w+')
    words = WORD.findall(text)
    return words

from gensim.corpora.dictionary import Dictionary
ttext = [regTokenize(d) for d in text]  
dic = Dictionary(ttext)
ttext = [dic.doc2bow(text) for text in ttext]
ttext = lda_model.get_document_topics(ttext)

However, dictionary in trained lda_model might be different from new data and gives error for the last line, like:

"IndexError: index 41021 is out of bounds for axis 1 with size 41021"

Is there any way (or parameter) to obtain dictionary from trained lda_model, to use it instead of dic = Dictionary(ttext)? Your help and answer much appreciated!


Solution

  • The general approach should be to store the dictionary created while training the model to a file using Dictionary.save method and read it back for reuse using Dictionary.load.

    Only then Dictionary.token2id remain the same and can be used to map ids to words and vice-versa for a pretrained model.