Search code examples
classificationdocumentmodelingldamallet

Generating documents from LDA topic model


I'm learning a topic model from a set of documents and that's working well. But I'm wondering if any existing system will actually generate new documents from the topics and words in the model.

Ie. say I want a new document of topic 0, will any of Gensim/MALLET/other tools actually produce a new document given some input of my topic choice (or choices)? Or is this a roll-your-own kind of problem?

Say I have two topics:

topic #0: 0.009*river + 0.008*lake + 0.006*island + 0.005*mountain + 0.004*area + 0.004*park + 0.004*antarctic + 0.004*south + 0.004*mountains + 0.004*dam
topic #1: 0.026*relay + 0.026*athletics + 0.025*metres + 0.023*freestyle + 0.022*hurdles + 0.020*ret + 0.017*divisão + 0.017*athletes + 0.016*bundesliga + 0.014*medals

Is there any tool that will take "topic 0: .5, topic 1: .5, length: 7" and nicely produce a document like:

island freestyle river south medals mountains area

or something along those lines? I don't want to duplicate this if it already exists.


Solution

  • Have you read the developer's guide and tutorials on the Mallet website? It outlines how to create a document with a high probability of a certain topic:

        StringBuilder topicZeroText = new StringBuilder();
        Iterator<IDSorter> iterator = topicSortedWords.get(0).iterator();
    
        int rank = 0;
        while (iterator.hasNext() && rank < 5) {
            IDSorter idCountPair = iterator.next();
            topicZeroText.append(dataAlphabet.lookupObject(idCountPair.getID()) + " ");
            rank++;
        }
    

    This code creates a new document with high probabiltiy of being topic 0. This code can be easily modified to contain more than one topic and have a certain length.