I have a dataset like this:
Date | Account | Spend |
---|---|---|
2/1/21 | A | 4 |
3/1/21 | A | 6 |
5/1/21 | A | 7 |
6/1/21 | A | 2 |
4/1/21 | B | 8 |
5/1/21 | B | 2 |
6/1/21 | B | 1 |
9/1/21 | B | 7 |
Note that the dates the accounts span are NOT the same. I want to fill in zeroes for the missing months that are in between the minimum and maximum date for each account. The final table would look like this:
Date | Account | Spend |
---|---|---|
2/1/21 | A | 4 |
3/1/21 | A | 6 |
4/1/21 | A | 0 |
5/1/21 | A | 7 |
6/1/21 | A | 2 |
4/1/21 | B | 8 |
5/1/21 | B | 2 |
6/1/21 | B | 1 |
7/1/21 | B | 0 |
8/1/21 | B | 0 |
9/1/21 | B | 7 |
What's the best way to go about this in Snowflake SQL?
I thought I could cross join the accounts with another table that contains all months. Then I could join back to the original table and fill in any missing values within the Spend column with zeroes. But I'm unsure how to deal with the "lagging" and "leading" nulls that result from this. For example, there would be a null value for the combination of 2/1/21 and B after the cross join but that date occurs before the first occurrence of B in the original table (4/1/21) so I wouldn't want that row in my final dataset.
We indeed can solve this using a calendar table of month beginnings, say calendar(date)
.
We can define the date range of each account in an aggregate subquery, then cross join
it with the calendar table ; this gives us all possible date/account tuples. All that is left to do is to try and bring the corresponding row, if any, with a left join
.
select c.date, a.account, coalesce(t.spend, 0) spend
from (
select account, min(date) min_date, max(date) max_date
from mytable
group by account
) a
inner join calendar c on c.date >= a.min_date and c.date <= a.max_date
left join mytable t on t.date = c.date and t.account = a.account