Search code examples
scalaapache-sparkspark-structured-streaming

Handle Too Late data in Spark Streaming


Watermark allows late arriving data to be considered for inclusion against already computed results for a period of time using windows. Its premise is that it tracks to a point in time before which it is assumed no more late events are supposed to arrive, but if they do, they are none-the-less discarded.

Is there a way to store the discarded data, that can be used for reconciliation purpose later? Say In my Structured Streaming, I set the watermark to 1 hour. I am doing window operation for each 10 min and received a later event 20 min late. Is there a way I can store the discarded data say at a different location rather than discarding it?


Solution

  • No, there is no way to achieve this aspect.