So I have some data as follows:
ID action_date data
001 2021-01-20 jams
002 2021-01-23 orange
003 2021-01-19 banana
001 2021-04-11 pineap
002 2021-03-01 grape
004 2021-03-01 apple
Using this data, I would like to make a table of cumulative sums of the entries in the above table.
What I mean by this is for every month, I would like to determine each user's history. If no data exists for the user in that month, the previous month's data is used.
I would like to have the unique id effectively be the ID of a user and the month.
The following variables be created:
action_date
appears in the month.data
value. exclusive of the current month.Final output looks as follows:
ID month_end action_taken_this_month total_actions most_recent_data
001 2021-01-31 1 0 NA
001 2021-02-28 0 1 jams
001 2021-03-31 0 1 jams
001 2021-04-30 1 1 jams
002 2021-01-31 1 0 NA
002 2021-02-28 0 1 orange
002 2021-03-31 1 1 orange
002 2021-04-30 0 2 grape
003 2021-01-31 1 1 NA
003 2021-02-28 0 1 banana
003 2021-03-31 0 1 banana
003 2021-04-30 0 1 banana
004 2021-03-31 1 1 NA
004 2021-04-30 0 1 apple
This is a request that's overly complicated by arbitrary rules, but I'm going to give you an answer that's 90% there.
It uses cross joins to generate the combinations with no data, and a UDF to generate the missing months. Then window functions to get the desired values:
with data as (
select x[4]::string id, x[5]::date action_date, x[6]::string data
from (
select split(value, ' ') x
from table(split_to_table(
$$ 001 2021-01-20 jams
002 2021-01-23 orange
003 2021-01-19 banana
001 2021-04-11 pineap
002 2021-03-01 grape
004 2021-03-01 apple$$, '\n'
)))), range_months as (
select date_trunc(month, min(action_date)) since, max(action_date) until
from data
), all_months as (
select value::date m
from range_months, table(flatten(list_months_between(since, until)))
), all_ids as (
select distinct id
from data
), all_crossed as (
select *
from all_months, all_ids
), left_joined as (
select m, a.id, data, b.action_date action_date
from all_crossed a
left join data b
on a.id=b.id
and date_trunc(month, b.action_date)=a.m
), almost_there as (
select m, id, count(action_date) action_taken_this_month, any_value(data) recent_data
, lag(action_taken_this_month) over(partition by id order by m) previous_action_taken_this_month
from left_joined
group by 1, 2
)
select id, last_day(m) month_end
, action_taken_this_month
, ifnull(sum(previous_action_taken_this_month) over(partition by id order by m), 0) total_actions
, lag(recent_data, 1) over(partition by id order by m) most_recent_data
from almost_there
order by 1, month_end
I wrote the UDF to generate a list of months in Python, but you can rewrite in your favorite language if desired:
create or replace function list_months_between(since date, until date)
returns array
language python
runtime_version='3.8'
packages=('pandas')
handler = 'x'
as
$$
import pandas as pd
def x(since, until):
return pd.date_range(since, until, freq='MS').strftime("%Y-%m-%d").tolist()
$$
;
select list_months_between('2020-01-01', '2020-03-10');
At the end of this series of sub-queries, we get results 90% like desired:
After all this work, I hope you can take this queries and add an extra join to replace some of the nulls on most_recent_data
with the last lookup value.