I create from a list of strings a list of instances consisting of token feature sequences. Via command line, I can prune those data based on counts, tf-idf etc. (https://github.com/mimno/Mallet/blob/master/src/cc/mallet/classify/tui/Vectors2Vectors.java). But what if I want to do it in Java? How do I have to extend my code?
My target is to remove most common words for LDA topic modeling.
public static InstanceList createInstanceList(List<String> texts) {
ArrayList<Pipe> pipes = new ArrayList<Pipe>();
pipes.add(new CharSequence2TokenSequence());
pipes.add(new TokenSequenceLowercase());
pipes.add(new TokenSequenceRemoveStopwords());
pipes.add(new TokenSequence2FeatureSequence());
InstanceList instanceList = new InstanceList(new SerialPipes(pipes));
instanceList.addThruPipe(new ArrayIterator(texts));
return instanceList;
}
Thank you in advance for your help!
Look at the code that you linked to for examples, starting around line 125. The FeatureCountTool
generates term frequency and document frequency information. You can then generate a pruned alphabet and construct a new instance list, as in Vectors2Vectors
, or generate a new stoplist Set
and reimport the documents from the source files.