Search code examples
sqlsnowflake-cloud-data-platformdate-arithmeticcross-join

Best way to interpolate missing dates by group in SQL


I have a dataset like this:

Date Account Spend
2/1/21 A 4
3/1/21 A 6
5/1/21 A 7
6/1/21 A 2
4/1/21 B 8
5/1/21 B 2
6/1/21 B 1
9/1/21 B 7

Note that the dates the accounts span are NOT the same. I want to fill in zeroes for the missing months that are in between the minimum and maximum date for each account. The final table would look like this:

Date Account Spend
2/1/21 A 4
3/1/21 A 6
4/1/21 A 0
5/1/21 A 7
6/1/21 A 2
4/1/21 B 8
5/1/21 B 2
6/1/21 B 1
7/1/21 B 0
8/1/21 B 0
9/1/21 B 7

What's the best way to go about this in Snowflake SQL?

I thought I could cross join the accounts with another table that contains all months. Then I could join back to the original table and fill in any missing values within the Spend column with zeroes. But I'm unsure how to deal with the "lagging" and "leading" nulls that result from this. For example, there would be a null value for the combination of 2/1/21 and B after the cross join but that date occurs before the first occurrence of B in the original table (4/1/21) so I wouldn't want that row in my final dataset.


Solution

  • We indeed can solve this using a calendar table of month beginnings, say calendar(date).

    We can define the date range of each account in an aggregate subquery, then cross join it with the calendar table ; this gives us all possible date/account tuples. All that is left to do is to try and bring the corresponding row, if any, with a left join.

    select c.date, a.account, coalesce(t.spend, 0) spend
    from (
        select account, min(date) min_date, max(date) max_date 
        from mytable
        group by account
    ) a
    inner join calendar c on c.date >= a.min_date and c.date <= a.max_date
    left join mytable   t on t.date = c.date and t.account = a.account