Search code examples
apache-flinkflink-streamingflink-statefun

Flink window aggregation with state


I would like to do a window aggregation with an early trigger logic (you can think that the aggregation is triggered either by window is closed, or by a specific event), and I read on the doc: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html#incremental-window-aggregation-with-aggregatefunction

The doc mentioned that Note that using ProcessWindowFunction for simple aggregates such as count is quite inefficient. so the suggestion is to pair with incremental window aggregation.

My question is that AverageAggregate in the doc, the state is not saved anywhere, so if the application crashed, the averageAggregate will loose all the intermediate value, right?

So If that is the case, is there a way to do a window aggregation, still supports incremental aggregation, and has a state backend to recover from crash?


Solution

  • The AggregateFunction is indeed only describing the mechanism for combining the input events into some result, that specific class does not store any data.

    The state is persisted for us by Flink behind the scene though, when we write something like this:

    input
      .keyBy(<key selector>)
      .window(<window assigner>)
      .aggregate(new AverageAggregate(), new MyProcessWindowFunction());
    

    the .keyBy(<key selector>).window(<window assigner>) is indicating to Flink to hold a piece of state for us for each key and time bucket, and to call our code in AverageAggregate() and MyProcessWindowFunction() when relevant.

    In case of crash or restart, no data is lost (assuming state backend are configured properly): as with other parts of Flink state, the state here will either be retrieved from the state backend or recomputed from first principles from upstream data.