Search code examples
javajacksongsonfasterxml

When parsing JSON with java, how to getText() bounded by a maximum amount?


I am attempting to parse the output of Apache Tika Server's rmeta web servivce endpoint: https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-RecursiveMetadataandContent

It's payloads look like the following:

[
 {"Application-Name":"Microsoft Office Word",
  "Application-Version":"15.0000",
  "X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.microsoft.ooxml.OOXMLParser"],
  "X-TIKA:content":"this content string can be many MB large"
  ...
 },
 {"Content-Encoding":"ISO-8859-1",
  "Content-Length":"8",
  "Content-Type":"text/plain; charset=ISO-8859-1"
  "X-TIKA:content":"again, this content string can be many MB large",
  ...
 }
 ...
]

As indicated, the X-TIKA:content strings can be quite oppressively large. Enough to OOM my JVM if I load the entire string into memory.

So if I were to use JsonParser.getText() like this:

  private void parseRmetaResponse(CloseableHttpResponse response) {
      ObjectMapper objectMapper = new ObjectMapper();
      JsonFactory jsonFactory = objectMapper.getFactory();
      JsonParser jsonParser = jsonFactory.createParser(response.getEntity().getContent());
      JsonToken arrayStartToken = jsonParser.nextToken();
      if (arrayStartToken != JsonToken.START_ARRAY) {
        throw new IllegalStateException("The first element of the Json structure was expected to be a start array token, but it was: " + arrayStartToken);
      }


      JsonToken nextToken = jsonParser.nextToken();
      while (nextToken != JsonToken.END_ARRAY) {
        parseNextField(jsonParser);
      }

  }

  private String getTextContents(JsonParser jsonParser, OutputStream os, Metadata metadata) throws IOException {
    String nextAttr = jsonParser.nextFieldName();
    if ("X-TIKA:content".equals(nextAttr)) {
      return jsonParser.getText();
    }
    // ...
  }

It would be prone to OOM crashes because I cannot load all of that string in memory without eating up all the JVM heap.

Instead I have a maximum number of chars parameter maxChars that I want to stop reading chars from X-TIKA:content after I reach that number.

How can I say "get me text, but only read up to maxChars characters, and discard any additional characters"?

I can use GSON, Fasterxml Jackson, or any other library that helps me do what I need to do here.


Solution

  • Instead of calling String getText(), you can call int getText(Writer writer).

    Give it a custom Writer that works similar to StringWriter, but discards any characters beyond a given threshold.

    The you would use it like this:

    if ("X-TIKA:content".equals(nextAttr)) {
        try (LimitedStringWriter writer = new LimitedStringWriter(maxParseChars)) {
            jsonParser.getText(writer);
            return writer.toString();
        }
    }
    

    Writing the LimitedStringWriter class is your job to do.


    Added by questioner (Nicholas DiPiazza):
    Here is an example of an implementation you could use as an example: https://github.com/ow2-proactive/scheduling/blob/master/common/common-api/src/main/java/org/ow2/proactive/utils/BoundedStringWriter.java