Search code examples
javahadoopstanford-nlp

How to load SR parser file in hdfs in the mapper?


I am trying to use the CoreNLP project in a mapreduce program to find the sentiment of a large number of text stored in hbase tables. I am using the SR parser for parsing. The model file is stored in hdfs at /user/root/englishSR.ser.gz. I have added the below line in the mapreduce application code

 job.addCacheFile(new URI("/user/root/englishSR.ser.gz#model"));

Now in the mapper

 props.setProperty("parse.model", "./model");

I am getting edu.stanford.nlp.io.RuntimeIOException: java.io.StreamCorruptedException: invalid stream header. The pom.xml file contains

<dependency>
        <groupId>edu.stanford.nlp</groupId>
        <artifactId>stanford-corenlp</artifactId>
        <version>3.4.1</version>
</dependency>
<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>3.4.1</version>
    <classifier>models</classifier>
</dependency>

I have tried adding the file to resources and adding to the maven with all resulting in GC overhead limit exceeded or Java Heap issues.


Solution

  • I don't know hadoop well, but I suspect that you're confusing CoreNLP about the compression of the SR parser model.

    First try this without using Hadoop:

    java -mx4g edu.stanford.nlp.parser.shiftreduce.ShiftReduceParser -serializedPath /user/root/englishSR.ser.gz
    

    See if that loads the parser fine. If so, it should print something like the below and exit (otherwise, it will throw an exception...).

    Loading parser from serialized file edu/stanford/nlp/models/srparser/englishSR.ser.gz ... done [10.4 sec].
    

    If that loads a parser fine, then there is nothing wrong with the model file. I think the problem is then that CoreNLP simply uses whether a file or resource name ends in ".gz" to decide whether it is gzipped, and so it wrongly interprets the line:

    props.setProperty("parse.model", "./model");
    

    as saying to load a not-gzipped model. So I would hope that one or other of the below would work:

    cd /user/root ; gunzip englishSR.ser.gz
    
    job.addCacheFile(new URI("/user/root/englishSR.ser#model"));
    
    props.setProperty("parse.model", "./model");
    

    Or:

    job.addCacheFile(new URI("/user/root/englishSR.ser#model.gz"));
    
    props.setProperty("parse.model", "./model.gz");