I'm processing a stream of events coming from IOT devices.
These events have a first level of timestamp, set by the network. They're also packing together several measures taken at different points in time. For instance:
The measures are to be aggregated hourly, in this case M1 should go in an 8:00-9:00 window, and M2 in a 9:00-10:00 window.
I wonder what is the proper way to design my flink app, manage those timestamps, and the related watermarks. From my understanding so far:
assignTimestampsAndWatermarks()
, correct ? What strategy should I use, given the 15mn spread there is between measures coming simultaneously ?--
PS: nope, I can't change the IOT device
PPS: I plan to use EMR, so flink 1.11, if it has any impact on design.
Typically, with an out-of-order event stream, you want to use the bounded-of-order watermark strategy with a duration large enough to cover the expected out-of-orderness. So at least 15 minutes, in this case.
If you are aggregating hourly windows, this should be pretty workable -- assuming you can tolerate waiting until 15 minutes after the hour ends to see any results. If you can do incremental aggregation of the window results (via reduce
or aggregate
) that will be much more efficient.