I'm trying to figure out what the weight assigned to each word in a topic represents in Mallet.
I'm assuming it's some form of document occurrence count. However, I'm having a hard time figuring out how that figure is arrived at.
In my model, there are several words that occur in more than one topic, and in each topic they have a different weight assigned, so clearly the number is not the word count over the entire corpus. My next guess was that the number is the occurrences of the word in the total set of documents that are assigned to the topic, but when I tried to verify that manually, this seems to be incorrect.
As an example: I'm training a model over a corpus of about 12,000 documents (alpha 0.1, beta 0.01, t = 50). After training, my model has the following topic:
t1 = "knoflook (158.0), olie (156.0), ...."
So the word 'knoflook' is assigned a weight of 158. Yet when I manually count the number of documents in my corpus that contain that word and have t1
assigned, I get a completely different number (1855).
It's possible that my manual verification is off, of course, but it would be useful to know, in general, how the word weight in each topic is arrived at.
By the way, the above topic is a rendering based on the following code:
// The data alphabet maps word IDs to strings
Alphabet dataAlphabet = instances.getDataAlphabet();
// Get an array of sorted sets of word ID/count pairs
ArrayList<TreeSet<IDSorter>> topicSortedWords = topicModel.getSortedWords();
for (int t = 0; t < numberOfTopics; t++) {
Iterator<IDSorter> iterator = topicSortedWords.get(t).iterator();
StringBuilder sb = new StringBuilder();
while (iterator.hasNext()) {
IDSorter idWeightPair = iterator.next();
final String wordLabel = dataAlphabet.lookupObject(idWeightPair.getID()).toString();
final double weight = idWeightPair.getWeight();
sb.append(wordLabel + " (" + weight + "), ");
}
sb.setLength(sb.length() - 2);
// sb.toString is now a human-readable representation of the topic
}
Mallet assigns each word token to a topic. The getSortedWords()
method counts how many word tokens are of a particular type (eg knoflook) and are also assigned to topic k. The division of tokens into documents does not matter for this calculation.
If I'm understanding correctly, you're finding that there are 1855 documents that have a word token of type knoflook and also have a word token assigned to topic t1. But there's no guarantee that those two tokens are the same.
From other work looking at recipes, I would guess that garlic is a common ingredient that occurs in many contexts, and probably has high probability in many topics. It would not be surprising if many instances of the word are assigned to other topics.