Search code examples
javajsonhadoopjacksonrecordreader

Hadoop + Jackson parsing: ObjectMapper reads Object and then breaks


I am implementing a JSON RecordReader in Hadoop with Jackson. By now I am testing locally with JUnit + MRUnit. The JSON files contain one object each, that after some headers, it has a field whose value is an array of entries, each of which I want to be understood as a Record (so I need to skip those headers).

I am able to do this by advancing the FSDataInputStream up to the point of reading. In my local testing, I do the following:

fs = FileSystem.get(new Configuration());
in = fs.open(new Path(filename));
long offset = getOffset(in, "HEADER_START_HERE");       
in.seek(offset);

where getOffset is a function where points the InputStream where the field value starts - which works OK, if we look at in.getPos() value.

I am reading the first record by:

ObjectMapper mapper = new ObjectMapper();
JsonNode actualObj = mapper.readValue (in, JsonNode.class);

The first record comes back fine. I can use mapper.writeValueAsString(actualObj) and it has read it fine, and it was valid.

Fine till here.

So I try to iterate the objects, by doing:

ObjectMapper mapper = new ObjectMapper();
JsonNode actualObj = null;
do {
    actualObj = mapper.readValue (in, JsonNode.class);
    if( actualObj != null) {
        LOG.info("ELEMENT:\n" + mapper.writeValueAsString(actualObj) );
    }
} while (actualObj != null) ;

And it reads the first one, but then it breaks:

java.lang.NullPointerException: null
    at org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:54)
    at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:57)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:243)
    at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:273)
    at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:225)
    at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:193)
    at java.io.DataInputStream.read(DataInputStream.java:132)
    at org.codehaus.jackson.impl.ByteSourceBootstrapper.ensureLoaded(ByteSourceBootstrapper.java:340)
    at org.codehaus.jackson.impl.ByteSourceBootstrapper.detectEncoding(ByteSourceBootstrapper.java:116)
    at org.codehaus.jackson.impl.ByteSourceBootstrapper.constructParser(ByteSourceBootstrapper.java:197)
    at org.codehaus.jackson.JsonFactory._createJsonParser(JsonFactory.java:503)
    at org.codehaus.jackson.JsonFactory.createJsonParser(JsonFactory.java:365)
    at org.codehaus.jackson.map.ObjectMapper.readValue(ObjectMapper.java:1158)

Why is this exception happening?

Does it have to do with being reading locally?

Is it needed some kind of reset or something when reusing an ObjectMapper or its underlying stream?


Solution

  • I managed to work it around. In case it helps:

    First of all, I'm using Jackson 1.x latest version. It seems that once JsonParser is instantiated with an InputStream, it takes control over it. So, when using readValue(), once it is read (internally it calls _readMapAndClose() which automatically closes the stream. There is a setting that you can set to tell the JsonParser not to close the underlying stream. You can pass it to your JsonFactory like this before your create your JsonParser:

    JsonFactory f = new MappingJsonFactory();
    f.configure(JsonParser.Feature.AUTO_CLOSE_SOURCE, false);
    

    Beware you are responsible for closing the stream (FSDataInputStream in my case). So, answers:

    • Why is this exception happening?

    Because the parser manages the stream, and closes it after readValue().

    • Does it have to do with being reading locally?

    No

    • Is it needed some kind of reset or something when reusing an ObjectMapper or its underlying stream?

    No. What you need to be aware of when using Streaming API mixed with ObjectMapper-like methods, is that sometimes the mapper/parser may take control of the underlying stream. Refer to the Javadoc of JsonParser and check the documentation on each of the reading methods to meet your needs.