sql snowflake-cloud-data-platform window-functions cumulative-sum

Cumulative sum for months that do and dont exist in a Snowflake table

So I have some data as follows:

ID   action_date     data
001   2021-01-20     jams
002   2021-01-23   orange
003   2021-01-19   banana
001   2021-04-11   pineap
002   2021-03-01    grape
004   2021-03-01    apple

Using this data, I would like to make a table of cumulative sums of the entries in the above table.

What I mean by this is for every month, I would like to determine each user's history. If no data exists for the user in that month, the previous month's data is used.

I would like to have the unique id effectively be the ID of a user and the month.

The following variables be created:

action_taken_this_month - Does the ID appear this month in the original table? Identified if the action_date appears in the month.
total_actions - How many times did the ID appear in the table, exclusive of the current month.
most_recent_data - What was the most recent data value. exclusive of the current month.

Final output looks as follows:

ID    month_end   action_taken_this_month     total_actions most_recent_data
001  2021-01-31                         1                 0               NA
001  2021-02-28                         0                 1             jams
001  2021-03-31                         0                 1             jams
001  2021-04-30                         1                 1             jams
002  2021-01-31                         1                 0               NA
002  2021-02-28                         0                 1           orange
002  2021-03-31                         1                 1           orange
002  2021-04-30                         0                 2            grape
003  2021-01-31                         1                 1               NA
003  2021-02-28                         0                 1           banana
003  2021-03-31                         0                 1           banana
003  2021-04-30                         0                 1           banana
004  2021-03-31                         1                 1               NA
004  2021-04-30                         0                 1            apple

Solution

This is a request that's overly complicated by arbitrary rules, but I'm going to give you an answer that's 90% there.

It uses cross joins to generate the combinations with no data, and a UDF to generate the missing months. Then window functions to get the desired values:

with data as (
    select x[4]::string id, x[5]::date action_date, x[6]::string data
    from (
    select split(value, ' ') x
    from table(split_to_table(
$$    001 2021-01-20 jams
    002 2021-01-23 orange
    003 2021-01-19 banana
    001 2021-04-11 pineap
    002 2021-03-01 grape
    004 2021-03-01 apple$$, '\n'
)))), range_months as (
    select date_trunc(month, min(action_date)) since, max(action_date) until
    from data
), all_months as (
    select value::date m
    from range_months, table(flatten(list_months_between(since, until)))
), all_ids as (
    select distinct id
    from data
), all_crossed as (
    select *
    from all_months, all_ids
), left_joined as (
    select m, a.id, data, b.action_date action_date
    from all_crossed a
    left join data b
    on a.id=b.id
    and date_trunc(month, b.action_date)=a.m
), almost_there as (
    select m, id, count(action_date) action_taken_this_month, any_value(data) recent_data
        , lag(action_taken_this_month) over(partition by id order by m) previous_action_taken_this_month
    from left_joined
    group by 1, 2
)

select id, last_day(m) month_end
    , action_taken_this_month
    , ifnull(sum(previous_action_taken_this_month) over(partition by id order by m), 0) total_actions
    , lag(recent_data, 1) over(partition by id order by m) most_recent_data
from almost_there
order by 1, month_end

I wrote the UDF to generate a list of months in Python, but you can rewrite in your favorite language if desired:

create or replace function list_months_between(since date, until date)
returns array
language python
runtime_version='3.8'
packages=('pandas')
handler = 'x'
as 
$$
import pandas as pd
def x(since, until):
    return pd.date_range(since, until, freq='MS').strftime("%Y-%m-%d").tolist()
$$
;

select list_months_between('2020-01-01', '2020-03-10');

At the end of this series of sub-queries, we get results 90% like desired:

After all this work, I hope you can take this queries and add an extra join to replace some of the nulls on most_recent_data with the last lookup value.