apache-spark apache-spark-mllib apache-spark-ml apache-spark-2.0

No topicDistributions(..) method in ml.clustering.LocalLDAModel

I am using org.apache.spark.ml.clustering.LDA for topic modelling (with online optimizer) and it returns org.apache.spark.ml.clustering.LocalLDAModel. However, using this model there doesn't seem to be any way to get the topic distribution over documents. While older mllib API (org.apache.spark.mllib.clustering.LocalLDAModel ) does have the method for exactly that i.e. org.apache.spark.mllib.clustering.LocalLDAModel.topicDistributions(..)

I am not sure why it is so. Specially, given that the new ml.LDA uses older mllib.LDA and wraps the older mllib.LocalLDAModel itself in the new ml.LocalLDAModel.

So, can someone please clarify: 1. Why this is so? 2. What is the correct way, if any, to get topic distributions in the new ml.LocalLDAModel?

P.S. I can always modify the spark code to expose the old API but I am not sure why was it hidden in the first place.

Solution

Consider ldaModel.transform(dataset) which extends your dataset with an additional column topicDistribution where you get what you want (dataset is the dataset that you passed to the fit() method of your LDAModel instance.