I have pasted a screenshot from Spark Structured Streaming documentation.
My question is when I perform the actual count, am I not getting incorrect total count? For example, let's I perform a count at 12:10. I want to find out how many owls are there. Based on the diagram below I will get 2 owls, where in reality only 1 owl came in.
The above is a sliding window aggregation where you are aggregating based on event time and word columns.
In the world of streaming, aggregates (and other stateful operators, like joins) change over time. At precisely 12:10, consider what records you've received so far:
cat dog
dog dog
owl cat
So, if you do a .count()
after splitting by space, you do, in fact, only have 1 owl. This is reflected in the middle green box, which shows 1 owl for both of the sliding windows.
However, at 12:13, you receive owl
. This means that any .count()
after 12:13 will take into account this second owl. So, if we run a .count()
at 12:15 pm (as in the diagram), 2 owls are around: the one at 12:07, and the one at 12:13. This leads to the following: