I am trying to use the CoreNLP project in a mapreduce program to find the sentiment of a large number of text stored in hbase
tables. I am using the SR parser for parsing. The model file is stored in hdfs at /user/root/englishSR.ser.gz
. I have added the below line in the mapreduce application code
job.addCacheFile(new URI("/user/root/englishSR.ser.gz#model"));
Now in the mapper
props.setProperty("parse.model", "./model");
I am getting edu.stanford.nlp.io.RuntimeIOException: java.io.StreamCorruptedException: invalid stream header
.
The pom.xml
file contains
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.4.1</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.4.1</version>
<classifier>models</classifier>
</dependency>
I have tried adding the file to resources
and adding to the maven
with all resulting in GC overhead limit exceeded
or Java Heap issues.
I don't know hadoop well, but I suspect that you're confusing CoreNLP about the compression of the SR parser model.
First try this without using Hadoop:
java -mx4g edu.stanford.nlp.parser.shiftreduce.ShiftReduceParser -serializedPath /user/root/englishSR.ser.gz
See if that loads the parser fine. If so, it should print something like the below and exit (otherwise, it will throw an exception...).
Loading parser from serialized file edu/stanford/nlp/models/srparser/englishSR.ser.gz ... done [10.4 sec].
If that loads a parser fine, then there is nothing wrong with the model file. I think the problem is then that CoreNLP simply uses whether a file or resource name ends in ".gz" to decide whether it is gzipped, and so it wrongly interprets the line:
props.setProperty("parse.model", "./model");
as saying to load a not-gzipped model. So I would hope that one or other of the below would work:
cd /user/root ; gunzip englishSR.ser.gz
job.addCacheFile(new URI("/user/root/englishSR.ser#model"));
props.setProperty("parse.model", "./model");
Or:
job.addCacheFile(new URI("/user/root/englishSR.ser#model.gz"));
props.setProperty("parse.model", "./model.gz");