Search code examples
apache-flinkflink-streaming

Understanding event emission timing in Flink Interval Joins with large windows


An interval join joins elements of two streams with a common key and where elements of stream B have timestamps that lie in a relative time interval to timestamps of elements in stream A.

I am confused how often joined events will be emitted. The join requires a lower bound and upper bound to be specified to define the window. If the lower bound and upper bound are large, like 7 days, does this mean that the window will only fire every 7 days? I understand there is a watermark driving this, but it's not as clear how it works in comparison to something like a tumbling window join.

As a follow up question, how can I make stream A's events and stream B's events live in Flink's state for 7 days, and then emit events immediately after 2 events are joined if they fall within the specified bounds?


Solution

  • An interval join will emit results immediately, as soon as inputs arrive that can be joined together. The watermarks are only used to purge events from the state store that can no longer affect the results. In other words, an interval join is the same as a regular join, except that the runtime doesn't have to keep every row in state forever. (The reason why Flink bothers with the notion of an interval join is because its state handling can be optimized.)