Search code examples
stanford-nlp

NullPointerException with Stanford NLP Spanish POS tagging


All -

Running Stanford CoreNLP 3.4.1, plus the Spanish models. I have a directory of approximately 100 Spanish raw text documents, UTF-8 encoded. For each one, I execute the following commandline:

java -cp stanford-corenlp-3.4.1.jar:stanford-spanish-corenlp-2014-08-26-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-0.23.jar -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props <propsfile> -file <txtfile>

The props file looks like this:

annotators = tokenize, ssplit, pos
tokenize.language = es
pos.model = edu/stanford/nlp/models/pos-tagger/spanish/spanish-distsim.tagger

For almost every file, I get the following error:

Exception in thread "main" java.lang.RuntimeException: Error annotating : at edu.stanford.nlp.pipeline.StanfordCoreNLP$15.run(StanfordCoreNLP.java:1287) at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:1347) at edu.stanford.nlp.pipeline.StanfordCoreNLP.run(StanfordCoreNLP.java:1389) at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1459) Caused by: java.lang.NullPointerException at edu.stanford.nlp.tagger.maxent.ExtractorSpanishStrippedVerb.extract(ExtractorFramesRare.java:1626) at edu.stanford.nlp.tagger.maxent.Extractor.extract(Extractor.java:153) at edu.stanford.nlp.tagger.maxent.TestSentence.getExactHistories(TestSentence.java:465) at edu.stanford.nlp.tagger.maxent.TestSentence.getHistories(TestSentence.java:440) at edu.stanford.nlp.tagger.maxent.TestSentence.getHistories(TestSentence.java:428) at edu.stanford.nlp.tagger.maxent.TestSentence.getExactScores(TestSentence.java:377) at edu.stanford.nlp.tagger.maxent.TestSentence.getScores(TestSentence.java:372) at edu.stanford.nlp.tagger.maxent.TestSentence.scoresOf(TestSentence.java:713) at edu.stanford.nlp.sequences.ExactBestSequenceFinder.bestSequence(ExactBestSequenceFinder.java:91) at edu.stanford.nlp.sequences.ExactBestSequenceFinder.bestSequence(ExactBestSequenceFinder.java:31) at edu.stanford.nlp.tagger.maxent.TestSentence.runTagInference(TestSentence.java:322) at edu.stanford.nlp.tagger.maxent.TestSentence.testTagInference(TestSentence.java:312) at edu.stanford.nlp.tagger.maxent.TestSentence.tagSentence(TestSentence.java:135) at edu.stanford.nlp.tagger.maxent.MaxentTagger.tagSentence(MaxentTagger.java:998) at edu.stanford.nlp.pipeline.POSTaggerAnnotator.doOneSentence(POSTaggerAnnotator.java:147) at edu.stanford.nlp.pipeline.POSTaggerAnnotator.annotate(POSTaggerAnnotator.java:110) at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:67) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:847) at edu.stanford.nlp.pipeline.StanfordCoreNLP$15.run(StanfordCoreNLP.java:1275)

Any ideas? I haven't even begun to track this down. I'm certain the problem is in POS; tokenize and ssplit run just fine.

P.S. Please don't say "Upgrade to 3.5.0"; I don't currently have Java 8 installed and don't want to install it yet.

Thanks in advance.


Solution

  • Yes, it seems like there's a bug in the 3.4.1 Spanish models.

    The Spanish 3.5.0 models actually seem to be compatible with Java 7. You can download the models used in 3.5 (stanford-spanish-corenlp-2014-10-23-models.jar) and put that on your classpath instead. This fixed the problem for me running Java 7 locally.