Search code examples
jsonserializationbsonjson-serializationcbor

Binary JSON format that supports traversal


Does anyone know of a serialisation format that:

  1. Is binary and at least relatively compact,
  2. Can store JSON-style data (not Protobuf, Thrift, etc.),
  3. Supports traversal (i.e. you don't need to parse the entire document to read one part of it), and
  4. Supports large files (e.g. 30 GB)?

I have looked at the following:

  • CBOR - doesn't support traversal
  • MessagePack - doesn't support traversal
  • UBJSON - doesn't support traversal
  • Smile - doesn't support traversal

  • BSON - does support traversal! But the maximum document size is 2 GB.

BSON was so close but the maximum file size kills it for me. Are there any formats that would work? Obviously I can write my own, but there are sooooo many binary JSON formats, surely someone has made a decent one?

Edit: By "traversal" I mean the same thing that the BSON authors mean - you should be able to find a given object without having to parse the entire file. Amazon calls this "sparse" or "shallow" reading.


Solution

  • Found one! Amazon Ion. From the FAQ:

    Many reads are shallow or sparse, meaning that the application is focused on only a subset of the values in the stream, and that it can quickly determine if full materialization of a value is required.

    In the spirit of these principles, the Ion specification includes features that make Ion’s binary encoding more efficient to read than other schema-free formats. These features include length-prefixing of binary values and Ion’s use of symbol tables.

    Brief notes on Ion:

    • Seems to be relatively well designed.
    • All values are TLV-encoded, which makes it traversable (yeay!)
    • The length values aren't limited to 32 bits (yeay!)
    • It has a slightly richer object model than JSON, e.g. it supports timestamps, binary data, type annotations and S-expressions (not sure why).
    • It supports a symbol table so field names can be interned! That means it is probably significantly more compact than all of the other binary JSON formats.

    It is not very popular. Libraries are available for only a few languages and I can't even find a command line tool that uses it. Still, it seems to be the only option if you want these features!

    Edit:

    In the end we went with SQLite which is pretty excellent. It doesn't really follow the JSON data model but it does let you do sparse reads very easily and it is very fast. Another possibility is DuckDB which is kind of a modern take on SQLite but less widely supported.