Search code examples
pythonnlpgensimlda

gensim LDA training


I am working with gensim LDA model for a project. I cant seem to find a proper number of topics. My question is, just to be sure, every time I train the model it re-starts, right? For example, I try it out with 47 topics, terrible results; so then I go back to the cell and change 47 to 80 topics and run it again. It completely starts a new training and erases what it has learned with the 47 topics, right?

I am having terrible results with LDA, similarity comes to 100% or 0% and I am having trouble parameter tuning. LSI has given me excellent results. Thanks!


Solution

  • Yes, every time you train LDA, it forgets what it has learned so far.

    Some suggestions and comments that may help you to get better results:

    • Make sure that you've preprocessed the text appropriately. This usually includes removing punctuation and numbers, removing stopwords and words that are too frequent or rare, (optionally) lemmatizing the text. Preprocessing is dependent on the language and the domain of the texts.
    • About the hyperparameters, you can use the "auto" mode for alpha and beta, letting the model learn the best values of alpha and beta. If you want to fix them, usually values lower than 1 are suggested. Check this
    • LDA is a probabilistic model, which means that if you re-train it with the same hyperparameters, you will get different results each time.