Unmarshaling a proto encoded message without knowing what type it is

I would like to take a []byte into a proto.Message without knowing what type it is beforehand.

To add some more detail, I know the set S of the types which the encoded type/message can be. (they are all types declared in my own proto files and built into the Go binary.)

I wanted to see if it was possible to take a byte array and reconstruct the original message back from it.

I have written this demo: https://play.golang.org/p/WF9KpTlZnp7 I am able to decode it into a dynamicpb if i pass it the descriptor, and get a message back from the Any.

Solution

The Protocol Buffers wire format is not self-describing. That means that information about the type of the protocol buffer is not encoded with the message itself. This is why during unmarshal, the type must be provided.

On the wire, the protobuf format is really simple. Essentially only the field number (note not the name, only the number) and the value (serialized to bytes) are encoded.

There are two strategies for dynamically decoding messages.

One is using Any.

https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/any.proto

For a message to be able to be decoded using the any package, it must first be encoded using the any package and the type must be registered in both the decoding and encoding proto type registry. This is because, any is implemented by simply encoding the underlying message and then putting that into a message with a string which references the type. Often this is called an envelope because like a letter the original message is wrapped with this additional context (the type of the message).

This strategy works best if you can control both the encoding and decoding code. This would be the recommended strategy if that’s possible.

The other strategy is to use dynamicpb and to try to parse Unknown Fields. This works because as of proto 3.5, unknown fields were added. If a proto message is deserialized but the field isn’t known by the original type the remaining fields are shoved into unknown fields. Type information and field name aren’t passed in the message type. So these fields will appear as unknown fields with an unknown name and type.

If your messages are different but, for example, share a field with the same number and type that describe the type, this could be used to first deserialize that field, and then to deserialize the remaining fields after switching on the information known in the first field.

This is more of a work-around for the system that doesn’t control the encoding path and isn’t the recommended strategy.