Search code examples
javapythonjsonparsingcaliper

Robust json parser in Python or Java


I'm looking for a robust json parser in either Python or Java (so far, I've been working with Python, but as I'm using it to analyze a Java benchmark, using Java is a reasonable alternative.)

Robust with respect to truncated and incomplete documents.

The reason is that I'm currently using caliper for some (micro-) benchmarks, and while the benchmark is still running (or if I canceled it prematurely), the output file will not be a complete JSON document. Neither json nor simplejson will read these files which are essentially truncated at some point.

(I don't like the Caliper web interface, because it is slow, does not scale to large experiment sets, and a lot of data fails to submit and is then missing from the run.)

Roughly, the documents look like this:

[
  {
    // first record, in multiple lines
  },
  {
    // second record, in multiple lines
  },
  {
    // truncated record.

Right now, I'm using a nasty hack, that uses the known indentation that caliper currently produces to split the result document at },\n\ \ { into chunks, then parse only these until the last one fails. But that is a nasty hack, and not robust towards future changes of caliper output. I also tried using raw_decode, but it would still expect complete documents, and not return a meaningful result at each },.

I'm looking for an API similar to e.g. XML pull, which would allow me to access the document up to the point where it was truncated, in an event-based API. Essentially, I'm interested in all complete {} sections inside the wrapper [].


Solution

  • Jackson supports event-based parsing. It also allows you to stream the document, but use the tree API for the parts which are interesting to you. There's a blog post demonstrating this approach here.