Search code examples
pythonnlpldatopic-modeling

Calculating optimal number of topics for topic modeling (LDA)


I am going to do topic modeling via LDA. I run my commands to see the optimal number of topics. The output was as follows: It is a bit different from any other plots that I have ever seen. Do you think it is okay? or it is better to use other algorithms rather than LDA. It is worth mentioning that when I run my commands to visualize the topics-keywords for 10 topics, the plot shows 2 main topics and the others had almost a strong overlap. Is there any valid range for coherence?

Many thanks to share your comments as I am a beginner in topic modeling.

enter image description here


Solution

  • Shameless self-promotion: I suggest you use the OCTIS library: https://github.com/mind-Lab/octis It allows you to run different topic models and optimize their hyperparameters (also the number of topics) in order to select the best result.

    There might be many reasons why you get those results. But here some hints and observations:

    • Make sure that you've preprocessed the text appropriately. This usually includes removing punctuation and numbers, removing stopwords and words that are too frequent or rare, (optionally) lemmatizing the text. Preprocessing is dependent on the language and the domain of the texts.
    • LDA is a probabilistic model, which means that if you re-train it with the same hyperparameters, you will get different results each time. A good practice is to run the model with the same number of topics multiple times and then average the topic coherence.
    • There are a lot of topic models and LDA works usually fine. The choice of the topic model depends on the data that you have. For example, if you are working with tweets (i.e. short texts), I wouldn't recommend using LDA because it cannot handle well sparse texts.
    • Check how you set the hyperparameters. They may have a huge impact on the performance of the topic model.
    • The range for coherence (I assume you used NPMI which is the most well-known) is between -1 and 1, but values very close to the upper and lower bound are quite rare.

    References: https://www.aclweb.org/anthology/2021.eacl-demos.31/