Search code examples
nlpgensimldacorpus

Gensim: Unable to train the LDA model


I have a list of sentences, and I follow the instructions at the tutorial to make a corpora from it:

texts = [[word for word in document.lower().split() if word.isalpha()] for document in documents]
corpus = corpora.Dictionary(texts)

I want to train a LDA model on this corpora and extract the topics keywords.

lda = models.LdaModel(corpus, num_topics=10)

However, I receive an error while training: TypeError: 'int' object is not iterable. What am I doing wrong? What the format of a corpus should be?


Solution

  • After making a corpora you should make a single corpus with doc2bow which makes hashes from words (so-called 'hashing trick'):

    texts = [[word for word in document.lower().split() if word.isalpha()] for document in documents]
    corpus = corpora.Dictionary(texts)
    hashed_corpus = [corpora.doc2bow(text) for text in texts]
    

    And after that you can train your model with hashed_corpus:

    lda = models.LdaModel(corpus, id2word=corpus, num_topics=5) 
    

    id2word maps your topics from hashes to words, and using of it makes it possible to get topics as words, not numbers.