Search code examples
event-handlingapache-kafkaapache-flinkflink-streaming

How does Flink consume messages from a Kafka topic with multiple partitions, without getting skewed?


Say we have 3 kafka partitions for one topic, and I want my events to be windowed by the hour, using event time.

Will the kafka consumer stop reading from a partition when it is outside of the current window? Or does it open a new window? If it is opening new windows then wouldn't it be theoretically possible to have it open a unlimited amount of windows and thus run out of memory, if one partition's event time would be very skewed compared to the others? This scenario would especially be possible when we are replaying some history.

I have been trying to get this answer from reading documentation, but can not find much about the internals of Flink with Kafka on partitions. Some good documentation on this specific topic would be very welcome.

Thanks!


Solution

  • So first of all events from Kafka are read constantly and the further windowing operations have no impact on that. There are more things to consider when talking about running out-of-memory.

    • usually you do not store every event for a window, but just some aggregate for the event
    • whenever window is closed the corresponding memory is freed.

    Some more on how Kafka consumer interacts with EventTime (watermarks in particular you can check here