Search code examples
apache-sparkspark-structured-streaming

Handling late events with Spark Structured Streaming watermark?


In my Structured Streaming, I set the watermark to 1 hour.

I am doing window operation for each 10 min.

I received a later event 20 min late.

Will the corresponding window will be calculated or not?


Solution

  • Watermark allows late arriving data to be considered for inclusion against already computed results for a period of time using windows. Its premise is that it tracks to a point in time before which it is assumed no more late events are supposed to arrive, but if they do, they are none-the-less discarded. There are various modes of operation.

    Excellent examples on https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time with nice diagrams to complement.

    Your question: Yes, that example you quote will include the late data, as it is in this case as you describe it within the window of 1 hr.