Search code examples
dateapache-spark-sqlsumdate-range

Spark SQL - How to SUM values over date range linked to a especific date


I need to do a sum of the values in the last 30 days (exclusive) relative to that date, with every product in every store. Assuming all months with 30 days:

date|store|product|values
2020-06-30|Store1|Product1|1
2020-07-02|Store1|Product2|4
2020-07-01|Store2|Product1|3
2020-07-18|Store1|Product1|4
2020-07-18|Store1|Product2|2
2020-07-18|Store2|Product1|2
2020-07-30|Store1|Product1|1
2020-08-01|Store1|Product1|1
2020-08-01|Store1|Product2|1
2020-08-01|Store2|Product1|6

In the lines of day 2020-08-01, sum the values of (2020-08-20 - 30 days) to 2020-08-19 and put it in the 2020-08-20 line, like this: (first line doesn't include '2020-06-30' because is more than 30 days ago and '2020-08-01' because is the same day, and this goes on...)

date|store|product|sum_values_over_last_30_days_to_this_date
2020-08-01|Store1|Product1|5
2020-08-01|Store1|Product2|6
2020-08-01|Store2|Product1|5
....

Tried this below and nothing too:

spark.sql("""
SELECT 
a.date,
a.store,
a.product,
SUM(a.values) OVER (PARTITION BY a.product,a.store ORDER BY a.date BETWEEN a.date - INTERVAL '1' DAY AND a.date - INTERVAL '30' DAY) AS sum
FROM table a
""").show()

Anybody can help me?


Solution

  • You can try self-join rather than window function, maybe this kind of join will work -

    SELECT 
        a.date, 
        a.store,
        a.product,
        SUM(IFNULL(b.value,0))
    FROM
        table a
    LEFT JOIN
        (
            SELECT
                a.date, 
                a.store,
                a.product,
                a.value
            FROM
                table  a
        )b
    ON
        a.store = b.store
    AND
        a.product = b.product
    AND
        a.date > b.date - INTERVAL 30 DAYS
    AND a.date <= b.date
    GROUP BY 
        1,2,3
    

    Make sure to sum the value from the inner query, to have the sum up until this daye.