Search code examples
ldamalletpruning

Remove most common words mallet


I create from a list of strings a list of instances consisting of token feature sequences. Via command line, I can prune those data based on counts, tf-idf etc. (https://github.com/mimno/Mallet/blob/master/src/cc/mallet/classify/tui/Vectors2Vectors.java). But what if I want to do it in Java? How do I have to extend my code?

My target is to remove most common words for LDA topic modeling.

public static InstanceList createInstanceList(List<String> texts) {

    ArrayList<Pipe> pipes = new ArrayList<Pipe>();

    pipes.add(new CharSequence2TokenSequence());
    pipes.add(new TokenSequenceLowercase());
    pipes.add(new TokenSequenceRemoveStopwords());
    pipes.add(new TokenSequence2FeatureSequence());

    InstanceList instanceList = new InstanceList(new SerialPipes(pipes));

    instanceList.addThruPipe(new ArrayIterator(texts));
    return instanceList;
}

Thank you in advance for your help!


Solution

  • Look at the code that you linked to for examples, starting around line 125. The FeatureCountTool generates term frequency and document frequency information. You can then generate a pruned alphabet and construct a new instance list, as in Vectors2Vectors, or generate a new stoplist Set and reimport the documents from the source files.