Search code examples
apache-flinkflink-streamingwatermark

Proper way to manage watermark in flink with event scattered in time


I'm processing a stream of events coming from IOT devices.

These events have a first level of timestamp, set by the network. They're also packing together several measures taken at different points in time. For instance:

  • network time 9:08
  • measure M1 taken at 8:52
  • measure M2 taken at 9:07

The measures are to be aggregated hourly, in this case M1 should go in an 8:00-9:00 window, and M2 in a 9:00-10:00 window.

I wonder what is the proper way to design my flink app, manage those timestamps, and the related watermarks. From my understanding so far:

  • I should probably put all the processing related to network time (9:08) in a separate Flink app.
  • Have a Flink app processing the measures after they are unpacked (flap-mapped). Then assign the timestamp with assignTimestampsAndWatermarks(), correct ? What strategy should I use, given the 15mn spread there is between measures coming simultaneously ?

--

PS: nope, I can't change the IOT device

PPS: I plan to use EMR, so flink 1.11, if it has any impact on design.


Solution

  • Typically, with an out-of-order event stream, you want to use the bounded-of-order watermark strategy with a duration large enough to cover the expected out-of-orderness. So at least 15 minutes, in this case.

    If you are aggregating hourly windows, this should be pretty workable -- assuming you can tolerate waiting until 15 minutes after the hour ends to see any results. If you can do incremental aggregation of the window results (via reduce or aggregate) that will be much more efficient.