java serialization schema protocol-buffers thrift

What is the problem of evolving a data schema of a serialization framework (like ProtoBuf, Thrift, etc.) by always adding new fields?

I'm writing a simple serialization framework. My intention is not to compete with ProtoBuf, Thrift, Avro, etc., far from that. My goal is to learn.

My question is related to evolving the schema of my data objects.

Let's say I have two programs, A and B, that need to exchange data and I have a data object represented by the schema below:

public byte[] accountUser = new byte[8]; // first field

Great! Now I want to go ahead and add an accountId field to my data object schema:

public byte[] accountUser = new byte[8]; // first field
public int accountId = -1; // second field just added (NEW FIELD)

Scenario 1:

Program A has the new schema with accountId and Program B does not.

Program A sends the data object to Program B.

Program B will simply read the data up to accountUser and totally ignore accountId. It knows nothing about it and it wasn't updated to use the latest data object schema with the accountId.

Everything works!

Scenario 2:

Program A has the old schema without accountId and Program B has the new schema with accountId.

Program A sends the data object to Program B.

Program B will read the data up to accountUser and proceed to try to read the new accountId. But there is nothing more to read in the data object received. No more data after accountUser. So Program B simply assumes the default null value of -1 for the accountId and move on with its life. I will most probably have logic to deal with a -1 accountId from legacy systems still operating with the old schema.

Everything works!

So what is really the problem of this simple approach for schema evolution? It is not perfect I know, but can't it be successfully used? I just have to assume that I will never remove any field and that I will never mess with the order of the fields. I just keep adding more fields.

Solution

Adding new fields by itself isn't a problem, as long as the protocol is itself field-based via some kind of header. Obviously, if it is size/blit based, there will be a problem as it will read the incorrect amount of data per record. Adding fields is exactly how most protocols work, so it isn't a problem but the decoder does need to know, in advance, how to ignore a field that it doesn't understand. Does it skip some fixed number of bytes? Does it look for some closing sentinel? Something else? As long as your decoder knows how to ignore every possible kind of field that it doesn't know about: you're fine.

You also shouldn't assume simple incremental fields, IMO. I have seen, in real world scenarios, where a structure is branched in two different ways by different teams, then recombined, so every combination of

A
A, B
A, C
A, B, C

(where B and C are different sets of additional fields) are possible

I just have to assume that I will never remove any field and that I will never mess with the order of the fields.

This happens. You need to deal with it; or accept that you're solving a simpler problem, so your solution is simpler.