Search code examples
apache-flinkavroflink-streaming

How does Flink handle serialization of managed state?


What format does Flink persists managed state of an operator (either for checkpointing or for communication between logical operators (i.e. along edges of the job graph)?

The documentation reads

Standard types such as int, long, String etc. are handled by serializers we ship with Flink. For all other types, we fall back to Kryo.

What are those serializers that ship with Flink?

Background: I am considering switching from JSON to using AVRO for both ingesting data into my Sources, and also emitting data to my Sinks. However, the auto-generated POJO classes created by Avro are rather noisy. So within the Job graph (for communication between Flink operators) I am contemplating whether there is any performance benefit to using a binary serialization format like Avro. It may be that there is no material performance impact (since Flink potentially uses an optimized format as well), and it just has to do more with types compatibility. But I just wanted to get more information on it.


Solution

  • Flink uses its own, built-in serialization framework for basic types, POJOs, and case classes, and it is designed to be efficient. Avro does have advantages in the area of schema evolution, which is relevant when considering Flink's savepoints. On that topic, see this message on the user mailing list.