Search code examples
javatopic-modelingmallet

Extracting keywords from relevant topics using a trained MALLET Topic model


I'm attempting to use MALLET's TopicInferencer to infer keywords from arbitrary text using a trained model. So far my overall approach is as follows.

  • Train a ParallelTopicModel with a large set of known training data to create a collection of topics. I'm currently using a training file with 250,000 lines to create 5,000 topics.
  • Create an InstanceList from arbitrary text not in the trained model.
  • Use the trained model's topicInferencer.getSampledDistribution to generate a topic distribution of the unknown text against the model.
  • Sort the returned distribution and extract the IDs of the top n topics that closets match the unknown input text.
  • Extract the top keywords from each of the matching topics.

My code is as follows:

Generating the ParallelTopicModel

InstanceList instanceList = new InstanceList(makeSerialPipeList());
instanceList.addThruPipe(new SimpleFileLineIterator(trainingFile)); //training file with one entry per line (around 250,000 lines)

//should train a model with the end result being 5000 topics each with a collection of words
ParallelTopicModel parallelTopicModel = new ParallelTopicModel(
    5000, //number of topics, I think with a large sample size we should want a large collection of topics
    1.0D, //todo: alphaSum, really not sure what this does
    0.01D //todo: beta, really not sure what this does
);
parallelTopicModel.setOptimizeInterval(20); //todo: read about this
parallelTopicModel.addInstances(instanceList);
parallelTopicModel.setNumIterations(2000);
parallelTopicModel.estimate();

My first group of questions are related to creating the ParallelTopicModel.

Since I'm using a fairly large training file I assume I want a large number of topics. My logic here is that the large the count of topics the more closely inferred keywords will match arbitrary input text.

I'm also unsure how the alphaSum beta value and number of iterations will affect the generated model.

On the other side I'm using the ParallelTopicModel to create an inferred topic distribution.

TopicInferencer topicInferencer = parallelTopicModel.getInferencer();
String document = //arbitrary text not in trained model
//following the format I found in SimpleFileLineIterator to create an Instance out of a document
Instance instance = new Instance(document, null, new URI("array:" + 1), null);
InstanceList instanceList = new InstanceList(serialPipes); //same SerialPipes used to create the InstanceList used for the ParallelTopicModel
instanceList.addThruPipe(instance);

//this should return the array of topicIDs and the match value
//[topicId] = 0.5 //match value
double[] topicDistribution = 
topicInferencer.getSampledDistribution(instanceList.get(0), //extract text
    2000, //same iteration count used in the model
     1, //todo: thinning, not sure what this does
     5 //todo: burnIn, not sure what this does
);

//returns a sorted list of the top 5 topic IDs
//this should be the index of the largest values in the returned topicDistribution
List<Integer> topIndexes = topIndexes(topicDistribution, 5); //top 5 topic indexes

//list topics and sorted keywords
ArrayList<TreeSet<IDSorter>> sortedWords = parallelTopicModel.getSortedWords();

//loop over the top indexes
topIndexes.forEach(index -> {
    IDSorter idSorter = sortedWords.get(index).first(); //should hopefully be the first keyword in each topic
    //not sure what alphabet I should use here or if it really matters?
    //I passed in the alphabet from the original instance list as well as the one contained on our model
    Object result = parallelTopicModel.getAlphabet().lookupObject(idSorter.getID());
    double weight = idSorter.getWeight();

    String formattedResult = String.format("%s:%.0f", result, weight);
    //I should now have a relevant keyword and a weight in my result
});

I have a similar set of questions here, first I'm not entirely sure if this overall approach is event correct.

I'm also not sure what Alphabet I should be using, the one from my InstanceList used to generate the ParallelTopicModel or the one obtained directly from the ParallelTopicModel.

I know this is a fairly involved question but any insight would be greatly appreciated!


Solution

  • alphaSum and beta: These control how concentrated you want your estimated doc-topic and topic-word distributions. Smaller values encourage more concentrated distributions, and less movement. Larger values encourage flatter, more uniform distributions, with more movement. In physics terms, think "high energy" vs. "low energy".

    1.0 for alphaSum is on the low end, 5.0 might be safer. 0.01 for beta is almost always fine. It doesn't matter a lot because you're using hyperparameter optimization (setOptimizeInterval). This will fit the alpha values based on the topics estimated. You might also want to set the burn-in period to something smaller than the default, like 25. This is the number of sweeps through the data before you start optimizing.

    Number of topics: 5000 is a lot even for 250k text segments. I'd start with 500. With hyperparameter optimization you will get a few large, general topics and lots of small, specific topics. My guess is that with 5000 topics a large number of them -- at least half -- will be essentially empty. That's good in one way, because it means the model is adaptively choosing its topic limit. But it will also mean that you have topics with very little data support that will look like random words. It will also cause problems at inference time: if the model sees a document that doesn't really fit any "real" topic it can put it into an empty topic.

    Inference: 2000 iterations is way too much for inferring topic distributions for new documents. 10 should be enough. Sampling topics for one document given a fixed, already-learned model is much easier than learning a model from scratch. The resulting distribution is the average over several sampling states. It will ignore burnin states at the beginning, and skip thinning states between saved samples after that.

    The alphabet shouldn't matter. You can assume that the same id will result in the same string. If the training alphabet and the testing alphabet aren't compatible, inference wouldn't work.