Search code examples
apache-sparkaggregatespark-structured-streaming

Sliding Windowed Grouped Aggregation


I have pasted a screenshot from Spark Structured Streaming documentation.

My question is when I perform the actual count, am I not getting incorrect total count? For example, let's I perform a count at 12:10. I want to find out how many owls are there. Based on the diagram below I will get 2 owls, where in reality only 1 owl came in.

enter image description here

The above is a sliding window aggregation where you are aggregating based on event time and word columns.


Solution

  • In the world of streaming, aggregates (and other stateful operators, like joins) change over time. At precisely 12:10, consider what records you've received so far:

    • 12:02: cat dog
    • 12:03: dog dog
    • 12:07: owl cat

    So, if you do a .count() after splitting by space, you do, in fact, only have 1 owl. This is reflected in the middle green box, which shows 1 owl for both of the sliding windows.

    However, at 12:13, you receive owl. This means that any .count() after 12:13 will take into account this second owl. So, if we run a .count() at 12:15 pm (as in the diagram), 2 owls are around: the one at 12:07, and the one at 12:13. This leads to the following:

    • The 12:00 to 12:10 window has one owl (the 12:07 one)
    • The 12:05 to 12:15 window has two owls (the 12:07 and 12:13 one)
    • The 12:10 to 12:20 window has one owl (the 12:13 one)