Search code examples
javastatisticsldatopic-modelingmallet

Strange perplexity values of LDA model trained with MALLET


I have trained an LDA model with MALLET on parts of the Stack Overflow data dump and did a 70/30 split for training and test data.

But the perplexity values are strange, because they are lower for the test set than for the training set. How is this possible? I thought the model is better fitted for the training data?

I have already double checked my perplexity calculations, but I do not find an error. Do you have any idea what the reason could be?

Thank you in advance!

enter image description here

Edit:

Instead of using the console output for the LL/token values of the training set, I have used the evaluator on the training set again. Now the values seem to be plausible.

enter image description here


Solution

  • That makes sense. The LL/token number is giving you the probability of both topic assignments and the observed words, whereas the held-out probability is giving you the marginal probability of just the observed words, summed over topics.