apache-kafka apache-flink apache-kafka-streams avro

Kafka Streams vs Flink

I wrote an application that reads 100.000 Avro records per second from Kafka topic, aggregates by key, use tumbling windows with 5 different sizes, do some calculation to know the highest, lowest, initial and end value, and write back to another Kafka topic.

This application already exists in Flink, but the source is RSocket in CSV format and the sink is Cassandra. The problem is that the new application is using a lot more CPU and memory. I checked this article and noticed performance is not mentioned.

Am I correct to assume the difference is mostly because of Avro serialisation / deserialisation, or is Flink supposed to be faster for this use case? If the difference is small, I'd prefer Kafka Streams to avoid needing to manage the cluster.

Solution

I don't think this question can be answered generally. Both Flink and Kafka Streaming can be tuned to the workload, and small changes in parameters can make a large difference in performance. Generally, there is no fundamental reason why Flink should be a lot faster for such a use case than Kafka Streams. One exception may be repartitioning, which always need to go through the Kafka cluster for Kafka streams and can stay within the cluster for Flink, but as I understand, you are not repartitioning in your use case.

Serialization format may play a large role, however. Some benchmarks that I remember for protobuf (for avro is similar) showed that the size in (Java) memory is 100x larger than the serialized data on the wire. Again, this depends on many things, in particular how nested/complex your schema is. If avro is deserialized to a complex object model, this will cause a significant CPU / memory overhead compared to passing strings around.

However, the only way to tell for certain what is slowing down your use case is profiling it and seeing where the additional resources are spent.