Search code examples
apache-flinkflink-streaming

Understanding Watermarks


I am just trying to assert my understanding of how BoundedOutOfOrder Watermarks works in FLINK and in general in any stream processing framework.

Order of Event Processing :

11:00   11:01      11:02      11:03     11:04    11:05      11:06   11:07   11:08      <=== Processing Time
|   
|
|--------1---------2----------3---------4--------5---------------------------------      <== INPUT EVENTS
|
|                                                      (Watermark)
|--------1------------2----------3-----------------5------W!-----------------4      <==== Event Time
        11:01       11:02      11:03             11:05  11:05              11:04

I have events with values (1 to 5) produced with an Event Timestamp of 11:01 to 11:05. Producer will send an event every second. Assuming the window size of (5 seconds and current window -- 11:00 to 11:05), With a basic bounded out-of-order watermarks using the following code (in FLINK),

WatermarkStrategy
    .<Tuple2<Long, String>>forBoundedOutOfOrderness(Duration.ofSeconds(5))
    .withTimestampAssigner((event, timestamp) -> event.f0);

Since, Watermark is just an event with a special timestamp indicating the end of a Window, Am I correct in understanding that FLINK will ingest a Watermark event in the pipeline after every 5 Seconds ?

Second question :

Considering the order of event processing outlined above, since "event-4" has arrived after the window was closed by the watermark, it will be considered as the "late data" and unless it's accounted for by the "allowdLateness" it will effectively be dropped.


Solution

  • Q1: "FLINK will ingest a Watermark event in the pipeline after every 5 Seconds"?

    No. Flink generates periodic watermarks, with the interval set to 200ms by default. You can use env.getConfig().setAutoWatermarkInterval() to change this. When your periodic watermark is generated, it's based on the max event time minus the bounded out-of-orderness value.

    Q2: "it will be considered as the "late data"".

    No. An event is only late if its timestamp is <= the current watermark. So if event timestamps are monotonically increasing (no out-of-order events, based on timestamps) then you'll never have late data.