I am attempting to parse the output of Apache Tika Server's rmeta
web servivce endpoint: https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-RecursiveMetadataandContent
It's payloads look like the following:
[
{"Application-Name":"Microsoft Office Word",
"Application-Version":"15.0000",
"X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.microsoft.ooxml.OOXMLParser"],
"X-TIKA:content":"this content string can be many MB large"
...
},
{"Content-Encoding":"ISO-8859-1",
"Content-Length":"8",
"Content-Type":"text/plain; charset=ISO-8859-1"
"X-TIKA:content":"again, this content string can be many MB large",
...
}
...
]
As indicated, the X-TIKA:content
strings can be quite oppressively large. Enough to OOM my JVM if I load the entire string into memory.
So if I were to use JsonParser.getText()
like this:
private void parseRmetaResponse(CloseableHttpResponse response) {
ObjectMapper objectMapper = new ObjectMapper();
JsonFactory jsonFactory = objectMapper.getFactory();
JsonParser jsonParser = jsonFactory.createParser(response.getEntity().getContent());
JsonToken arrayStartToken = jsonParser.nextToken();
if (arrayStartToken != JsonToken.START_ARRAY) {
throw new IllegalStateException("The first element of the Json structure was expected to be a start array token, but it was: " + arrayStartToken);
}
JsonToken nextToken = jsonParser.nextToken();
while (nextToken != JsonToken.END_ARRAY) {
parseNextField(jsonParser);
}
}
private String getTextContents(JsonParser jsonParser, OutputStream os, Metadata metadata) throws IOException {
String nextAttr = jsonParser.nextFieldName();
if ("X-TIKA:content".equals(nextAttr)) {
return jsonParser.getText();
}
// ...
}
It would be prone to OOM crashes because I cannot load all of that string in memory without eating up all the JVM heap.
Instead I have a maximum number of chars parameter maxChars
that I want to stop reading chars from X-TIKA:content
after I reach that number.
How can I say "get me text, but only read up to maxChars
characters, and discard any additional characters"?
I can use GSON, Fasterxml Jackson, or any other library that helps me do what I need to do here.
Instead of calling String getText()
, you can call int getText(Writer writer)
.
Give it a custom Writer
that works similar to StringWriter
, but discards any characters beyond a given threshold.
The you would use it like this:
if ("X-TIKA:content".equals(nextAttr)) {
try (LimitedStringWriter writer = new LimitedStringWriter(maxParseChars)) {
jsonParser.getText(writer);
return writer.toString();
}
}
Writing the LimitedStringWriter
class is your job to do.
Added by questioner (Nicholas DiPiazza):
Here is an example of an implementation you could use as an example: https://github.com/ow2-proactive/scheduling/blob/master/common/common-api/src/main/java/org/ow2/proactive/utils/BoundedStringWriter.java