Search code examples
apache-kafkastreamapache-flinkapache-kafka-streamsflink-streaming

Flink and non-keyed window state


I'm creating a Flink application that simply forwards windowed incoming Kafka events to another Kafka topic with an addition of start and end markers for each window - so for example, for a window of 1 hour containing 1, 2, 3, 4, 5, I will sink start_timestamp, 1, 2, 3, 4, 5, end_timestamp into a different Kafka topic. Potentially, there will be some other transforms later on, but in general, for N events coming in, I'm always going to be emitting at least N+2 events.

As I understand, using windowAll() with a ProcessAllWindowFunction that will inject start and end markers should do this. My question is around state management. I'll be using RocksDb state backend - will it also keeps persisting internal window state even for this non-keyed stream? My main concern is to be able to keep the state in the window so that I'm not reprocessing it again, especially for large windows.


Solution

  • For something this simple, I'd use a FlatMap (with parallelism set to 1) that keeps in state the time of the current window and the last event time. Whenever a record arrives, if it's in a new hourly window, I'd emit the end_timestamp (last event time), the start_timestamp (from the new record), and update the saved state's current hour. In all case the last event time in state is also updated. This assumes your incoming events are strictly ordered, so you don't have to worry about late data.