Search code examples
topic-modelingmallet

ParallellTopicModel - Thread option changes result significantly


I am currently using the ParallelTopicModel for topic modeling, but I've encountered some strange behavior. When I set different number of threads for the model, I get different results which should not happen if I'm right. The implementation we've written is used on different machines with a different number of maximal threads, but somehow the results are different. Random seed, documents, iterations etc. are the same.

Is this a known bug or expected? Or am I just doing something wrong?

Code Snippet:

    // Begin by importing documents from text to feature sequences
    final InstanceList instances = new InstanceList(docPipe);
    instances.addThruPipe(docsIter);
    final ParallelTopicModel model =
        new ParallelTopicModel(noOfTopics, m_alpha.getDoubleValue() * noOfTopics, m_beta.getDoubleValue());
    model.setRandomSeed(m_seed.getIntValue());
    model.addInstances(instances);
    model.setNumThreads(noOfThreads);
    model.setNumIterations(noOfIterations);
    try {
        model.estimate();
    } catch (RuntimeException e) {
        throw e;
    }

Solution

  • Each thread has its own random number generator. Setting the seed initializes each of these to the same sequence, so if you have the same number of threads you should get the same results. Each thread is responsible for its own segment of the collection.

    If you have a different number of threads, the same random numbers are being applied to different tokens, which have different sampling distributions, and so will have different sampling outcomes.

    Keeping a single random number generator would add a synchronization dependency, and would not guarantee identical results unless the threads are exactly synchronized.