Search code examples
serializationprotocol-buffersavroparquetthrift

Choosing serialization frameworks


I was reading about the downsides of using java serialization and the necessity to go for serialization frameworks. There are so many frameworks like avro, parquet, thrift, protobuff.

Question is what framework addresses what and what are all the parameters that are to be considered while choosing a serialization framework.

I would like to get hands on with a practical use case and compare/choose the serialization frameworks based on the requirements.

Can somebody please assist on this topic?


Solution

  • There are a lot of factors to consider. I'll go through some of the important ones.

    0) Schema first or Code First

    If you have a project that'll involve different languages, code first approaches are likely to be problematic. It's all very well have a JAVA class that can be serialised, but it might be a nuisance if that has to be deserialised in C.

    Generally I favour schema first approaches, just in case.

    1) Inter-object Demarkation

    Some serialisations produce a byte stream that makes it possible to see where one object stops and another begins. Other do not.

    So, if you have a message transport / data store that will separate out batches of bytes for you, e.g. ZeroMQ, or a data base field, then you can use a serialisation that doesn't demarkate messages. Examples include Google Protocol Buffers. With demarkation done by the transport / store, the reader can get a batch of bytes knowing for sure that it encompasses one object, and only one object.

    If your message transport / data store doesn't demarkate between batches of bytes, e.g. a network stream or a file, then either you invent your own demarkation markers, or use a serialisation that demarkates for you. Examples include ASN.1 BER, XML.

    2) Cannonical

    This is a property of a serialisation which means that the serialised data desribes its own structure. In principal the reader of a cannonical message doesn't have to know up front what the message structure is, it can simply work that out as it reads the bytes (even if it doesn't know the field names). This can be useful in circumstances where you're not entirely sure where the data is coming from. If the data is not cannonical, the reader has to know in advance what the object structure was otherwise the deserialisation is ambiguous.

    Examples of cannonical serialisations include ASN.1 BER, ASN.1 cannonical PER, XML. Ones that aren't include ASN.1 uPER, possibly Google Protocol Buffers (I may have that wrong).

    AVRO does something different - the data schema is itself part of the serialised data, so it is always possible to reconstruct the object from arbitrary data. As you can imagine the libraries for this are somewhat clunky in languages like C, but rather better in dynamic languages.

    3) Size and Value Constrained.

    Some serialisation technologies allow the developer to set constraints on the values of fields and the sizes of arrays. The intention is that code generated from a schema file incorporating such constraints will automatically validate objects on serialisation and on deserialistion.

    This can be extremely useful - free, schema driven content inspection done automatically. It's very easy to spot out-of-specification data.

    This is extremely useful in large, hetrogenous projects (lots of different languages in use), as all sources of truth about what's valid and what's not comes from the schema, and only the schema, and is enforced automatically by the auto-generated code. Developers can't ignore / get round the constraints, and when the constraints change everyone can't help but notice.

    Examples include ASN.1 (usually done pretty well by tool sets), XML (not often done properly by free / cheap toolsets; MS's xsd.exe purposefully ignores any such constraints) and JSON (down to object validators). Of these three, ASN.1 has by far the most elaborate of constraints syntaxes; it's really very powerful.

    Examples that don't - Google Protocol Buffers. In this regard GPB is extremely irritating, because it doesn't have constraints at all. The only way of having value and size constraints is to either write them as comments in the .proto file and hope developers read them and pay attention, or some other sort of non-sourcecode approach. With GPB being aimed very heavily at hetrogrenous systems (literally every language under the sun is supported), I consider this to be a very serious omission, because value / size validation code has to be written by hand for each language used in a project. That's a waste of time. Google could add syntactical elements to .proto and code generators to support this without changing wire foramts at all (it's all in the auto-generated code).

    4) Binary / Text

    Binary serialisations will be smaller, and probably a bit quicker to serialise / deserialise. Text serialisations are more debuggable. But it's amazing what can be done with binary serialisations. For example, one can easily add ASN.1 decoders to Wireshark (you compile them up from your .asn schema file using your ASN.1 tools), et voila - on the wire decoding of programme data. The same is possible with GPB I should think.

    ASN.1 uPER is extremely useful in bandwidth constrained situations; it automatically uses the size / value constraints to economise in bits on the wire. For example, a field valid between 0 and 15 needs only 4 bits, and that's what uPER will use. It's no coincidence that uPER features heavily in protocols like 3G, 4G, and 5G too I should think. This "minimum bits" approach is a whole lot more elegant than compressing a text wireformat (which is what's done a lot with JSON and XML to make them less bloaty).

    5) Values

    This is a bit of an oddity. In ASN.1 a schema file can define both the structure of objects, and also values of objects. With the better tools you end up with (in your C++, JAVA, etc source code) classes, and pre-define objects of that class already filled in with values.

    Why is that useful? Well, I use it a lot for defining project constants, and to give access to the limits on constraints. For example, suppose you'd got a an array field with a valid length of 15 in a message. You could have a literal 15 in the field constraint or you could cite the value of an integer value object in the constraint, with the integer also being available to developers.

    --ASN.1 value that, in good tools, is built into the 
    --generated source code
    arraySize INTEGER ::= 16
    
    --A SET that has an array of integers that size
    MyMessage ::= SET
    {
        field [0] SEQUENCE (SIZE(arraySize)) OF INTEGER
    }
    

    This is really handy in circumstances where you want to loop over that constraint, because the loop can be

    for (int i = 0; i < arraySize; i++) {do things with MyMessage.field[i];} // ArraySize is an integer in the auto generated code built from the .asn schema file
    

    Clearly this is fantastic if the constraint ever needs to be changed, because the only place it has to be changed is in the schema, followed by a project recompile (where every place it's used will use the new value). Better still, if it's renamed in the schema file a recompile identifies everywhere in the poject it was used (because the developer written source code that uses it is still using the old name, which is now an undefined symbol --> compiler errors.

    ASN.1 constraints can get very elaborate. Here's a tiny taste of what can be done. This is fantastic for system developers, but is pretty complicated for the tool developers to implement.

    arraySize INTEGER ::= 16
    minSize INTEGER ::= 4
    maxVal INTEGER ::= 31
    minVal INTEGER ::= 16
    oddVal INTEGER ::= 63
    
    MyMessage2 ::= SET
    {
        field_1 [0] SEQUENCE (SIZE(arraySize)) OF INTEGER,                              -- 16 elements
        field_2 [1] SEQUENCE (SIZE(0..arraySize)) OF INTEGER,                           -- 0 to 16 elements
        field_3 [2] SEQUENCE (SIZE(minSize..arraySize)) OF INTEGER,                     -- 4 to 16 elements
        field_4 [3] SEQUENCE (SIZE(minSize<..<arraySize)) OF INTEGER,                   -- 5 to 15 elements
        field_5 [4] SEQUENCE (SIZE(minSize<..<arraySize)) OF INTEGER(0..maxVal),        -- 5 to 15 elements valued 0..31
        field_6 [5] SEQUENCE (SIZE(minSize<..<arraySize)) OF INTEGER(minVal..maxVal),   -- 5 to 15 elements valued 16..31
        field_7 [6] SEQUENCE (SIZE(minSize<..<arraySize)) OF INTEGER(minVal<..maxVal),  -- 5 to 15 elements valued 17..31
        field_8 [7] SEQUENCE (SIZE(arraySize)) OF INTEGER(minVal<..<maxVal),            -- 16 elements valued 17..30
        field_9 [8] INTEGER (minVal..maxVal AND oddVal)                                 -- valued 16 to 31, and also 63
        f8_indx [10] INTEGER (0..<arraySize)                                            -- index into field 8, constrained to be within the bounds of field 8
    }
    

    So far as I know, only ASN.1 does this. And then it's only the more expensive tools that actually pick up these elements out of a schema file. With it, this makes it tremendously useful in a large project because literally everything related to data and its constraints and how to handle it is defined in only the .asn schema, and nowhere else.

    As I said, I use this a lot, for the right type of project. Once one has got it pervading an entire project, the amount of time and risk saved is fantastic. It changes the dynamics of a project too; one can make late changes to a schema knowing that the entire project picks those up with nothing more than a recompile. So, protocol changes late in a project go from being high risk to something you might be content to do every day.

    6) Wireformat Object Type

    Some serialisation wireformats will identify the type of an object in the wireformat bytestrean. This helps the reader in situations where objects of many different types may arrive from one or more sources. Other serialisations won't.

    ASN.1 varies from wireformat to wireformat (it has several, including a few binary ones as well as XML and JSON). ASN.1 BER uses type, value and length fields in its wireformat, so it is possible for the reader to peek at an object's tag up front decode the byte stream accordingly. This is very useful.

    Google Protocol Buffers doesn't quite do the same thing, but if all message types in a .proto are bundled up into one final oneof, and it's that that's only every serialised, then you can achieve the same thing

    7) Tools cost.

    ASN.1 tools range from very, very expensive (and really good), to free (and less good). A lot of others are free, though I've found that the best XML tools (paying proper attention to value / size constraints) are quite expensive too.

    8) Language Coverage

    If you've heard of it, it's likely covered by tools for lots of different languages. If not, less so.

    The good commercial ASN.1 tools cover C/C++/Java/C#. There are some free C/C++ ones of varying completeness.

    9) Quality

    It's no good picking up a serialisation technology if the quality of the tools is poor.

    In my experience, GPB is good (it generally does what it says it will). The commercial ASN1 tools are very good, eclipsing GPB's toolset comprehensively. AVRO works. I've heard of some occassional problems with Capt'n Proto, but having not used it myself you'd have to check that out. XML works with good tools.

    10) Summary

    In case you can't tell, I'm quite a fan of ASN.1.

    GPB is incredibly useful too for its widespread support and familiarity, but I do wish Google would add value / size constraints to fields and arrays, and also incorporate a value notation. If they did this it'd be possible to have the same project workflow as can be achieved with ASN.1. If Google added just these two features, I'd consider GPB to be pretty well nigh on "complete", needing only an equivalent of ASN.1's uPER to finish it off for those people with little storage space or bandwidth.

    Note that quite a lot of it is focused on what a project's circumstances are, as well as how good / fast / mature the technology actually is.