apache-spark pyspark apache-spark-mllib lda apache-spark-ml

LDA model prediction nonconsistance

I trained a LDA model and load it into the environment to transform the new data:

from pyspark.ml.clustering import LocalLDAModel

lda = LocalLDAModel.load(path)
df = lda.transform(text)

The model will add a new column called topicDistribution. In my opinion, this distribution should be same for the same input, otherwise this model is not consistent. However, it is not in practice.

May I ask the reason why and how to fix it?

Solution

LDA uses randomness when training and, depending on the implementation, when infering new data. The implementation in Spark is based on EM MAP inference so I believe it only uses randomness when training the model. This means that the results will be different each time the algorithm is trained and run.

To get the same results when running on the same input and same parameters, you can set the random seed when training the model. For example, to set the random seed to 1:

model = LDA.train(data, k=2, seed=1)

To set the seed when transforming new data, create a parameter map to overwrite the default value (None for seed).

lda = LocalLDAModel.load(path)
paramMap[lda.seed] = 1L
df = lda.transform(text, paramMap)

For more information about overwriting model parameters, see here.