I am using mallet from a scala project. After training the topic models and got the inferencer file, I tried to assign topics to new texts. The problem is I got different results with different calling methods. Here are the things I tried:
creating a new InstanceList and ingest just one document and get the topic results from the InstanceList
somecontentList.map(text=>getTopics(text, model))
def getTopics(text:String, inferencer: TopicInferencer):Array[Double]={
val testing = new InstanceList(pipe)
testing.addThruPipe(new Instance(text, null, "test instance", null))
inferencer.getSampledDistribution(testing.get(0), iter, 1, burnIn)
}
Put everything in a InstanceList and predict topics together.
val testing = new InstanceList(pipe)
somecontentList.foreach(text=>
testing.addThruPipe(new Instance(text, null, "test instance", null))
)
(0 until testing.size).map(i=>
ldaModel.getSampledDistribution(testing.get(i), 100, 1, 50))
These two methods produce very different results except for the first instance. What is the right way of using the inferencer?
Additional information: I checked the instance data.
0: topic (0)
1: beaten (1)
2: death (2)
3: examples (3)
4: forum (4)
5: wanted (5)
6: contributing (6)
I assume the number in parenthesis is the index of words used in prediction. When I put all text into the InstanceList, the index is different because the collection has more text. Not sure how exactly that information is considered in the model prediction process.
Remember that the new instances must be imported with the pipe from the original data as recorded in the Inferencer
in order for the alphabets to match. It's not clear where pipe
is coming from in the scala code, but the fact that the first six words seem to have what looks like it might be ids starting with 0 suggests that this is a new alphabet.