I am using org.apache.spark.ml.clustering.LDA
for topic modelling (with online optimizer)
and it returns org.apache.spark.ml.clustering.LocalLDAModel
. However, using this model there
doesn't seem to be any way to get the topic distribution over documents.
While older mllib
API (org.apache.spark.mllib.clustering.LocalLDAModel
) does have the method
for exactly that i.e. org.apache.spark.mllib.clustering.LocalLDAModel.topicDistributions(..)
I am not sure why it is so. Specially, given that the new ml.LDA
uses older
mllib.LDA
and wraps the older mllib.LocalLDAModel
itself in the new
ml.LocalLDAModel
.
So, can someone please clarify:
1. Why this is so?
2. What is the correct way, if any, to get topic distributions in the new
ml.LocalLDAModel
?
P.S. I can always modify the spark code to expose the old API but I am not sure why was it hidden in the first place.
Consider ldaModel.transform(dataset)
which extends your dataset with an additional column topicDistribution
where you get what you want (dataset
is the dataset that you passed to the fit()
method of your LDAModel
instance.