Search code examples
javatopic-modelingmallet

Mallet java: get probability distribution of a documents collection


I would like to get a single probability distribution for a collection of documents, as I need to be able to use the KL-Divergence, is this possible?

In this example: http://mallet.cs.umass.edu/topics-devel.php with the method getTopicProbabilities() I get the probability distribution of each instance, but if I wanted to get a single distribution for a collection of documents?

Could this be the topic distribution of the documents?

  TopicInferencer inferencer = model.getInferencer();
  double[] testProbabilities = inferencer.getSampledDistribution(testing.get(0), 10, 1, 5);

Solution

  • I think we can use some averaging on each topic probabilities for the set of documents. But this only makes sense when the documents are similar. May be you can cluster the documents based on some similarity threshold and average on those documents.