I need to do a sum of the values in the last 30 days (exclusive) relative to that date, with every product in every store. Assuming all months with 30 days:
date|store|product|values
2020-06-30|Store1|Product1|1
2020-07-02|Store1|Product2|4
2020-07-01|Store2|Product1|3
2020-07-18|Store1|Product1|4
2020-07-18|Store1|Product2|2
2020-07-18|Store2|Product1|2
2020-07-30|Store1|Product1|1
2020-08-01|Store1|Product1|1
2020-08-01|Store1|Product2|1
2020-08-01|Store2|Product1|6
In the lines of day 2020-08-01, sum the values of (2020-08-20 - 30 days) to 2020-08-19 and put it in the 2020-08-20 line, like this: (first line doesn't include '2020-06-30' because is more than 30 days ago and '2020-08-01' because is the same day, and this goes on...)
date|store|product|sum_values_over_last_30_days_to_this_date
2020-08-01|Store1|Product1|5
2020-08-01|Store1|Product2|6
2020-08-01|Store2|Product1|5
....
Tried this below and nothing too:
spark.sql("""
SELECT
a.date,
a.store,
a.product,
SUM(a.values) OVER (PARTITION BY a.product,a.store ORDER BY a.date BETWEEN a.date - INTERVAL '1' DAY AND a.date - INTERVAL '30' DAY) AS sum
FROM table a
""").show()
Anybody can help me?
You can try self-join
rather than window function, maybe this kind of join
will work -
SELECT
a.date,
a.store,
a.product,
SUM(IFNULL(b.value,0))
FROM
table a
LEFT JOIN
(
SELECT
a.date,
a.store,
a.product,
a.value
FROM
table a
)b
ON
a.store = b.store
AND
a.product = b.product
AND
a.date > b.date - INTERVAL 30 DAYS
AND a.date <= b.date
GROUP BY
1,2,3
Make sure to sum the value from the inner query, to have the sum up until this daye.