I was just struck with an odd exception from the entrails of StanfordNLP, when trying to tokenize:
java.lang.NullPointerException at edu.stanford.nlp.process.PTBLexer.zzRefill(PTBLexer.java:24511) at edu.stanford.nlp.process.PTBLexer.next(PTBLexer.java:24718) at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:276) at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:163) at edu.stanford.nlp.process.AbstractTokenizer.hasNext(AbstractTokenizer.java:55) at edu.stanford.nlp.process.DocumentPreprocessor$PlainTextIterator.primeNext(DocumentPreprocessor.java:270) at edu.stanford.nlp.process.DocumentPreprocessor$PlainTextIterator.hasNext(DocumentPreprocessor.java:334)
The code that cause it looks like this:
DocumentPreprocessor dp = new DocumentPreprocessor(new StringReader( tweet)); // unigrams for (List<HasWord> sentence : dp) { for (HasWord word : sentence) { // do stuff } } // bigrams for (List<HasWord> sentence : dp) { //<< exception is thrown here Iterator<HasWord> it = sentence.iterator(); String st1 = it.next().word(); while (it.hasNext()) { String st2 = it.next().word(); String bigram = st1 + " " + st2; // do stuff st1 = st2; } }
What is going on? Has this to do with me looping over the tokens twice?
This is certainly an ugly stacktrace, which can and should be improved. (I'm about to check in a fix for that.) But the reason that this doesn't work is that a DocumentProcessor acts like a Reader: It only lets you make a single pass through the sentences of a document. So after the first for-loop, the document is exhausted, and the underlying Reader has been closed. Hence the second for-loop fails, and here crashes out deep in the lexer. I'm going to change it so that it just will give you nothing. But to get what you want you either want to (most efficient) get both the unigrams and bigrams in one for-loop pass through the document or to create a second DocumentPreprocessor for the second pass.