Search code examples
protocol-buffersapache-arrowdata-exchange

Comparison of protobuf and arrow


Both are language-neutral and platform-neutral data exchange libraries. I wonder what are the difference of them and which library is good for which situations.


Solution

  • They are intended for two different problems. Protobuf is designed to create a common "on the wire" or "disk" format for data.

    Arrow is designed to create a common "in memory" format for the data.

    Of course, the next question, is what does this mean?

    In Protobuf, if an application wants to work with the data, they first deserialize the data into some kind of "in memory" representation. This must be done because the Protobuf format is not easily compatible with CPU instructions. For example, protobuf packs unsigned integers into varints. These have a variable # of bytes and the wire-type of the field is crammed into the 3 least significant bits. You cannot take two unsigned integers and just add them without first converting them to some kind of "in memory" representation.

    Now, protoc does have libraries for every language to convert to an "in memory" representation for those languages. However, this "in memory" representation is not common. You cannot take a Protobuf message, deserialize it into C# (using protoc generated code) and then process on these in-memory bytes in Java without doing some kind of C#->Java marshalling of the data.

    Arrow, on the other hand, solves this problem. If you have an Arrow table in C# you can map that memory to a different language and start processing on it without doing any kind of "language-to-language" marshaling of data. This zero-copy allows for efficient hand-off between languages. Python has been employing tricks like this (e.g. the array protocol) for a while now and it works great for data analysis.

    However, Arrow is not always the greatest format for over-the-wire transmission because it can be inefficient. Those varints I mentioned before help Protobuf cut down on message size. Also, Protobuf tags each field so it can save space when there are many optional fields. In fact, Arrow uses Protobuf & gRPC for over-the-wire transmission of metadata in Arrow Flight (an RPC framework).