Search code examples
apache-flinkflink-streamingflink-sql

Flink Windows - how to emit intermediate results as soon as new event comes in?


Flink 1.14, Java, Table API + DataStream API (toDataStream/toAppendStream). I'm trying to: read events from Kafka, hourly aggregate (sum, count, etc.) and upsert results to Cassandra as soon as new events are coming, in other words — create new records or re-calculate already existing records on every new event and sink results to Cassandra immediately. And aim is to see continuously updating sum, count values of primary keyed records. For this purposes I'm using SQL:

...
TUMBLE(TABLE mytable, DESCRIPTOR(action_datetime), INTERVAL '1' HOURS)
...

But task sending results to Cassandra after window interval is expired (every 1 hour). I know, its works as described in docs:

Unlike other aggregations on continuous tables, window aggregation do not emit intermediate results but only a final result, the total aggregation at the end of the window.

Question: How I can achieve that behavior (emit to sink intermediate results as soon as new event comes in)? Don't wait 1 hour for window closing.


Solution

  • Here are some options. Perhaps none of them is precisely what you want:

    (1) Use CUMULATE instead of TUMBLE. This won't give you updated results with every new event, but you can have the result updated frequently -- e.g., once a minute.

    (2) Use OVER aggregation. This will give you a continuously updated aggregation over the previous 60 minutes (aligned to each event, rather than to the epoch).

    (3) Use DataStream windows with a custom Trigger that fires with each event. This will provide the behavior you've asked for, but will require a rewrite using the DataStream API.