Search code examples
apache-sparknlpapache-spark-mlapache-spark-dataset

Get Automatic Topic Labels from LDA topic model in Apache Spark


I am doing Topic Modeling in Apache-Spark for Classification of certain products from unstructured data.

So far I have applied Topic modeling (LDA) and getting the topics,but I was wondering if there is any way to automatically infer topic labels from topics given by LDA.


Solution

  • LDA returns a distribution of the probabilities of each term in the dictionary to represent a specific topic. If you call describeTopics(n) on your LDAModel, you receive a DataFrame which contains the mapping of term weights to term indices for each topic.

    If you need to infer topic labels, I assume you want to obtain human readable terms which represent a specific topic most. However, there is no direct way to get this information from the LDAModel for free. Instead, you need to call describeTopics on it, and then zip the term indices with your dictionary.