Search code examples
apache-sparkspark-streamingspark-structured-streaming

Is there a way to use Spark Structured Streaming to calculate daily aggregates?


I am planning to use structured streaming to calculate daily aggregates across different metrics.

Data volume < 1000 records per day.

Here is the simple example of input data

timestamp, Amount
1/1/20 10:00, 100
1/1/20 11:00, 200
1/1/20 23:00, 400
1/2/20 10:00, 100
1/2/20 11:00, 200
1/2/20 23:00, 400
1/2/20 23:10, 400

Expected output

Day, Amount
1/1/20, 700
1/2/20, 1100

I am planning to do something like this in the structured streaming not sure if it works or if it's the right way to do it?

parsedDF.withWatermark("date", "25 hours").groupBy("date", window("date", "24 hours")).sum("amount")


Solution

  • There is material overhead from running structured streams. Given you're writing code to produce a single result every 24 hours it would seem a better use of resources to do the following if you can take an extra couple minutes of latency in trade for using far fewer resources.

    • Ingest data into a table, partitioned by day
    • Write a simple SQL query against this table to generate your daily aggregate(s)
    • Schedule the job to run [watermark] seconds after midnight.

    That's with the impression you're in the default output mode since you didn't specify one. If you want to stick with streaming, more context in your code and what your goal is would be helpful. For example, how often do you want results, and do you need partial results before the end of the day? How long do you want to wait for late data to update aggregates? What output mode are you planning to use?