I am tring to run LDA to generate some topics from txt files as the following one:
Document1 label1 forest=3.4 tree=5 wood=2.85 hammer=1 colour=1 leaf=1.5
Document2 label2 forest=10 tree=5 wood=2.75 hammer=1 colour=4 leaf=1
Document3 label3 forest=19 tree=0.90 wood=2 hammer=2 colour=9 leaf=4.3
Document4 label4 forest=4 tree=5 wood=10 hammer=1 colour=6 leaf=3
Each numeric value in the file is an indication of the number of occurrences of each feature (e.g., forest, tree) multiplied by a given penalty.
To generate instances from such a file, I use the following Java code:
String lineRegex = "^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$";
String dataRegex = "[\\p{L}([0-9]*\\.[0-9]+|[0-9]+)_\\=]+";
InstanceList generateInstances(String dataPath) throws UnsupportedEncodingException, FileNotFoundException {
ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
pipeList.add(new Target2Label());
pipeList.add( new CharSequenceLowercase() );
pipeList.add( new Input2CharSequence() );
pipeList.add( new CharSequence2TokenSequence(Pattern.compile(dataRegex)) );
/*pipeList.add( new TokenSequenceRemoveStopwords(new File(stopwordListPath), "UTF-8",
false, false, false) );*/
pipeList.add( new TokenSequenceParseFeatureString(true,true,"=") );
pipeList.add( new PrintInputAndTarget());
InstanceList instances = new InstanceList (new SerialPipes(pipeList));
Reader fileReader = new InputStreamReader(new FileInputStream(new File(dataPath)),
"UTF-8");
instances.addThruPipe(new CsvIterator (fileReader, Pattern.compile(lineRegex),
3, 2, 1));
return instances;
}
I then add the so-generated instances to my model using the instruction model.addInstances(generatedInstances). The resulting output is described below. It contains errors caused by the instruction model.addInstances(generatedInstances). Debugging my code showed me that the alphabet associated to the model is null. Am I using the wrong iterator? Can anyone help me fix my code?
name: document1
target: label1
input: TokenSequence [forest=3.4 feature(forest)=3.4 span[0..10], tree=5 feature(tree)=5.0 span[11..17], wood=2.85 feature(wood)=2.85 span[18..27], hammer=1 feature(hammer)=1.0 span[28..36], colour=1 feature(colour)=1.0 span[37..45], leaf=1.5 feature(leaf)=1.5 span[46..54]]
Token#0:forest=3.4 feature(forest)=3.4 span[0..10]
Token#1:tree=5 feature(tree)=5.0 span[11..17]
Token#2:wood=2.85 feature(wood)=2.85 span[18..27]
Token#3:hammer=1 feature(hammer)=1.0 span[28..36]
Token#4:colour=1 feature(colour)=1.0 span[37..45]
Token#5:leaf=1.5 feature(leaf)=1.5 span[46..54]
name: document2
target: label2
input: TokenSequence [forest=10 feature(forest)=10.0 span[0..9], tree=5 feature(tree)=5.0 span[10..16], wood=2.75 feature(wood)=2.75 span[17..26], hammer=1 feature(hammer)=1.0 span[27..35], colour=4 feature(colour)=4.0 span[36..44], leaf=1 feature(leaf)=1.0 span[45..51]]
Token#0:forest=10 feature(forest)=10.0 span[0..9]
Token#1:tree=5 feature(tree)=5.0 span[10..16]
Token#2:wood=2.75 feature(wood)=2.75 span[17..26]
Token#3:hammer=1 feature(hammer)=1.0 span[27..35]
Token#4:colour=4 feature(colour)=4.0 span[36..44]
Token#5:leaf=1 feature(leaf)=1.0 span[45..51]
name: document3
target: label3
input: TokenSequence [forest=19 feature(forest)=19.0 span[0..9], tree=0.90 feature(tree)=0.9 span[10..19], wood=2 feature(wood)=2.0 span[20..26], hammer=2 feature(hammer)=2.0 span[27..35], colour=9 feature(colour)=9.0 span[36..44], leaf=4.3 feature(leaf)=4.3 span[45..53]]
Token#0:forest=19 feature(forest)=19.0 span[0..9]
Token#1:tree=0.90 feature(tree)=0.9 span[10..19]
Token#2:wood=2 feature(wood)=2.0 span[20..26]
Token#3:hammer=2 feature(hammer)=2.0 span[27..35]
Token#4:colour=9 feature(colour)=9.0 span[36..44]
Token#5:leaf=4.3 feature(leaf)=4.3 span[45..53]
name: document4
target: label4
input: TokenSequence [forest=4 feature(forest)=4.0 span[0..8], tree=5 feature(tree)=5.0 span[9..15], wood=10 feature(wood)=10.0 span[16..23], hammer=1 feature(hammer)=1.0 span[24..32], colour=6 feature(colour)=6.0 span[33..41], leaf=3 feature(leaf)=3.0 span[42..48]]
Token#0:forest=4 feature(forest)=4.0 span[0..8]
Token#1:tree=5 feature(tree)=5.0 span[9..15]
Token#2:wood=10 feature(wood)=10.0 span[16..23]
Token#3:hammer=1 feature(hammer)=1.0 span[24..32]
Token#4:colour=6 feature(colour)=6.0 span[33..41]
Token#5:leaf=3 feature(leaf)=3.0 span[42..48]
Coded LDA: 5 topics, 3 topic bits, 111 topic mask
Exception in thread "main" java.lang.NullPointerException
at cc.mallet.topics.ParallelTopicModel.addInstances(ParallelTopicModel.java:217)
at mallet.examples.TopicModel3.runLDA(MyTopicModel.java:106)
at mallet.examples.TopicModel3.main(MyTopicModel.java:57)
Thanks in advance.
Here's the input formats that mallet uses: http://mallet.cs.umass.edu/import.php
your data is somehow the Svmlight format, which is like this: "target feature:value feature:value ..."
But unfortunately you cannot use this format for the Topic modelling, LDA!! It uses featureSequence, not featureVector. So what you can do instead is convert your input to bag of words, for example if you have Document2 label2 forest=3 tree=2 ... convert it to: Document2 label2 forest forest forest tree tree ...