Search code examples
sqlsnowflake-cloud-data-platformwindow-functionscumulative-sum

Cumulative sum for months that do and dont exist in a Snowflake table


So I have some data as follows:

ID   action_date     data
001   2021-01-20     jams
002   2021-01-23   orange
003   2021-01-19   banana
001   2021-04-11   pineap
002   2021-03-01    grape
004   2021-03-01    apple

Using this data, I would like to make a table of cumulative sums of the entries in the above table.

What I mean by this is for every month, I would like to determine each user's history. If no data exists for the user in that month, the previous month's data is used.

I would like to have the unique id effectively be the ID of a user and the month.

The following variables be created:

  1. action_taken_this_month - Does the ID appear this month in the original table? Identified if the action_date appears in the month.
  2. total_actions - How many times did the ID appear in the table, exclusive of the current month.
  3. most_recent_data - What was the most recent data value. exclusive of the current month.

Final output looks as follows:

ID    month_end   action_taken_this_month     total_actions most_recent_data
001  2021-01-31                         1                 0               NA
001  2021-02-28                         0                 1             jams
001  2021-03-31                         0                 1             jams
001  2021-04-30                         1                 1             jams
002  2021-01-31                         1                 0               NA
002  2021-02-28                         0                 1           orange
002  2021-03-31                         1                 1           orange
002  2021-04-30                         0                 2            grape
003  2021-01-31                         1                 1               NA
003  2021-02-28                         0                 1           banana
003  2021-03-31                         0                 1           banana
003  2021-04-30                         0                 1           banana
004  2021-03-31                         1                 1               NA
004  2021-04-30                         0                 1            apple

Solution

  • This is a request that's overly complicated by arbitrary rules, but I'm going to give you an answer that's 90% there.

    It uses cross joins to generate the combinations with no data, and a UDF to generate the missing months. Then window functions to get the desired values:

    with data as (
        select x[4]::string id, x[5]::date action_date, x[6]::string data
        from (
        select split(value, ' ') x
        from table(split_to_table(
    $$    001 2021-01-20 jams
        002 2021-01-23 orange
        003 2021-01-19 banana
        001 2021-04-11 pineap
        002 2021-03-01 grape
        004 2021-03-01 apple$$, '\n'
    )))), range_months as (
        select date_trunc(month, min(action_date)) since, max(action_date) until
        from data
    ), all_months as (
        select value::date m
        from range_months, table(flatten(list_months_between(since, until)))
    ), all_ids as (
        select distinct id
        from data
    ), all_crossed as (
        select *
        from all_months, all_ids
    ), left_joined as (
        select m, a.id, data, b.action_date action_date
        from all_crossed a
        left join data b
        on a.id=b.id
        and date_trunc(month, b.action_date)=a.m
    ), almost_there as (
        select m, id, count(action_date) action_taken_this_month, any_value(data) recent_data
            , lag(action_taken_this_month) over(partition by id order by m) previous_action_taken_this_month
        from left_joined
        group by 1, 2
    )
    
    select id, last_day(m) month_end
        , action_taken_this_month
        , ifnull(sum(previous_action_taken_this_month) over(partition by id order by m), 0) total_actions
        , lag(recent_data, 1) over(partition by id order by m) most_recent_data
    from almost_there
    order by 1, month_end
    

    I wrote the UDF to generate a list of months in Python, but you can rewrite in your favorite language if desired:

    create or replace function list_months_between(since date, until date)
    returns array
    language python
    runtime_version='3.8'
    packages=('pandas')
    handler = 'x'
    as 
    $$
    import pandas as pd
    def x(since, until):
        return pd.date_range(since, until, freq='MS').strftime("%Y-%m-%d").tolist()
    $$
    ;
    
    select list_months_between('2020-01-01', '2020-03-10');
    

    At the end of this series of sub-queries, we get results 90% like desired:

    enter image description here

    After all this work, I hope you can take this queries and add an extra join to replace some of the nulls on most_recent_data with the last lookup value.