I have a list of sentences, and I follow the instructions at the tutorial to make a corpora from it:
texts = [[word for word in document.lower().split() if word.isalpha()] for document in documents]
corpus = corpora.Dictionary(texts)
I want to train a LDA model on this corpora and extract the topics keywords.
lda = models.LdaModel(corpus, num_topics=10)
However, I receive an error while training: TypeError: 'int' object is not iterable
. What am I doing wrong? What the format of a corpus should be?
After making a corpora you should make a single corpus with doc2bow
which makes hashes from words (so-called 'hashing trick'):
texts = [[word for word in document.lower().split() if word.isalpha()] for document in documents]
corpus = corpora.Dictionary(texts)
hashed_corpus = [corpora.doc2bow(text) for text in texts]
And after that you can train your model with hashed_corpus
:
lda = models.LdaModel(corpus, id2word=corpus, num_topics=5)
id2word
maps your topics from hashes to words, and using of it makes it possible to get topics as words, not numbers.