Search code examples
performanceapache-flink

Understanding Serialization Performance in Apache Flink


For reasons of cleaner code, I currently work with immutable Java objects in my pipeline instead of POJOs. Reading the Flink blog post Flink Serialization Tuning Vol. 1: Choosing your Serializer — if you can, however, I learned that the performance of out-of-the-box serialization of POJOs is superior to explicit serialization with corresponding Kyro Serializer<T> classes. Unfortunately, I fail to understand the reasons for that. POJO serialization involves rather costly reflection, while I would expect that explicitly defining how to map a custom type to a byte stream (or array or buffer) and vice versa is faster. What is the reason for that or may I simply overlook something? Given the superior performance, I am considering modifying my types to POJOs, although I do not really like it.

Moreover, other stream processing engines such Kafka Streams, Hazelcast or Beam seem to require specifying serializers for non-standard types. Is Flink really different in this regard and, if yes, why?


Solution

  • Yes, Flink is really different in that it tries to reduce the need for custom serializers, by being smart about which serializer to use, and (worst case) falling back to the super-generic Kryo serialization framework.

    The ability for Kryo to handle almost anything you throw at it is why it's so much slower than POJO serialization.

    You can register custom serializers (via Kryo) for your types, and that should get you back most of the performance hit from using immutable classes.