Search code examples
protocol-buffersthrift-protocol

Can field tags be dropped from protobuf/thrift messages?


I understand that protobuf/thrift needs unique numerical field tags to provide version compatibility. They provide version compatibility by serializing messages (kind of) in this fashion:

<tag1> <value1> ... <tagN> <valueN>

When deserializing, they pick up the tag value, looks up message schema, and knows which field to fill the value into. In this way, as long as we add new fields with different tag value, the messages will be compatible.

But I don't think this is a very good design:

  1. The tag value has to be encoded within the message. This has some overhead.

    For example. When a client invokes an RPC method on a remote server many times, the tag values in every request/response are the same. It would be nice to only send <tag1> <value1> ... <tagN> <valueN> once, and then only send <value1> ... <valueN>.

  2. When changing the type of a field, we also need to change the tag value. Forgetting to do this will lead to bugs.

  3. Developers have to ensure tag values are unique. Usually people track the last used tag id and increase it when adding new fields. But when two people add fields in separate branches and make a merge, it's hard to resolve conflict.

I think a better design could be:

Create a compact schema for each message type, like this:

<field_name_1> <field_type_1> ... <field_name_N> <field_type_N> (sorted according to field_name)

To address issue 1, exchange message schema before doing anything. For the RPC example, the client will send its message schema before sending first RPC, then in the following RPC, it only sends <value_1> ... <value_N>. The server will have message schema when request arrives, and knows how to deserialize it.

To address issue 2, when the field type is changed, the compact message schema will be changed, too. Programs will be able to find out the old and new schema does not match, and reports error.

To address issue 3, developers no longer need to take care of assigning unique tag values. They still need to take care of assigning unique field names, but this should be easier, and less likely to lead to merge conflicts.

Could this be a usable design? And what will be the problems of it?


Solution

  • I believe Apache Avro works like you describe, so perhaps you should try that.

    However, I would argue that the upfront schema negotiation adds a huge amount of complication to the protocol which outweighs any benefit. It may seem easy enough in the simple case, but in a large-scale system where you have proxies (that don't know what they're proxying), dedicated storage servers, messages composed from pieces received from multiple senders with different protocol versions, etc., the complexity of tracking schema versions becomes a huge burden.