I am planning to use structured streaming to calculate daily aggregates across different metrics.
Data volume < 1000 records per day.
Here is the simple example of input data
timestamp, Amount
1/1/20 10:00, 100
1/1/20 11:00, 200
1/1/20 23:00, 400
1/2/20 10:00, 100
1/2/20 11:00, 200
1/2/20 23:00, 400
1/2/20 23:10, 400
Expected output
Day, Amount
1/1/20, 700
1/2/20, 1100
I am planning to do something like this in the structured streaming not sure if it works or if it's the right way to do it?
parsedDF.withWatermark("date", "25 hours").groupBy("date", window("date", "24 hours")).sum("amount")
There is material overhead from running structured streams. Given you're writing code to produce a single result every 24 hours it would seem a better use of resources to do the following if you can take an extra couple minutes of latency in trade for using far fewer resources.
That's with the impression you're in the default output mode since you didn't specify one. If you want to stick with streaming, more context in your code and what your goal is would be helpful. For example, how often do you want results, and do you need partial results before the end of the day? How long do you want to wait for late data to update aggregates? What output mode are you planning to use?