Search code examples
serializationprotocol-buffersthrift

Data serialization methods (protobuff/thrift) vs Yang from data modeling standpoint?


Both protobuff and thrift provides data definition capabilities - thrift structs and protobuff messages. Does it make them Data modeling languages besides being serialization formats?

Or there is something fundamentally different in how we modeling data with for e.g. Yang and through thrift or protobuff schema?


Solution

  • Yes I suppose they do amount to data modelling languages. The schema files one writes are a model of what the data looks like, and possibly what the data represents.

    For instance if a commonly used data component was representing bearing, then instead of simply using an int everywhere for a bearing one might define a specific bearing type, and then use that. Doing so allows you to change the primitive type used for a bearing, perhaps to a float, and not have to make changes anywhere else. You can quite happily do this in Thrift, Protobuf, etc. The type being called bearing leaves no room for supposing that it might represent time, and depending on what tools (schema compilers) actually do it might end up being impossible to assign a time to a bearing in source code. That is then quite handy - developers have to really try hard to make such mistakes.

    However, what this then reveals is a certain inadequacy in Thrift and Protobuf. That bearing type is all very well and good, but it is not constrained. It would be far better if the value of whatever primitive (int, float) could be constrained in the schema to, say, 0 to 359.

    Other schame languages do do this. The languages of JSON schema, XML schema (XSD), and ASN.1 schema all have syntax to allow the schema writer to express constraints. The end result is that these schema can better model your data, if your data is somehow constrained.

    ASN.1 constraints in particular can be very elaborate, allowing logical combinations of many different types of constraints (single values, value ranges, ranges dependent on constants set elsewhere in the schema, regular expressions, sets of ranges, and loads, loads more). Tool support for this complexit varies; the expensive paid-for ones are well worth it.

    The practical results in code are marked. For Thrift and Protobufs, the best you can do is put a comment in the schema to say that a data component has limited range, and hope that the developer reads the comment and obeys it. This is more likely in a team of 1, less likely in a team of lots of developers... If ignored, it's perfectly possible for a developer to construct a message containing a bearing of 375 (outside the desired constrain), and nothing will prevent that being serialised and communicated unless a developer writes code specifically to prevent that.

    Whereas as for, say, ASN.1, the schema has the constraints as part of the schema's syntax, the tools used to incorporate that schema in a project will automatically generate code to check that data being serialised or deserialised complies. Result: no dependency on developers reading the comments in the schema. The same applies to JSON and XSD, though a lot of XSD tools are in my experience rubbish at fully implementing the interface expressed in an XSD schema (e.g. MS's xsd.exe ignores constraints completely). I think JSON validators are better.

    In summary, yes Thrift and Protobufs can be used for data modelling, but their schema languages do not necessarily allow a "complete" model of data to be defined. Other schema langauges are better, permitting value / size constraints to also be defined for data, with different levels of support in tools.

    What none of them do is to allow the temporal / behavioural / relational qualities of data to be expressed (i.e. a message is sent once per second), but if one is to attempt to have a schema language that supports that I suspect it becomes Turing complete. In which case we already have a lot of existing programming languages to support that.