Search code examples
sqlsnowflake-cloud-data-platformsnapshot

Checking what rows have been added, dropped, remained on monthly snapshots


I have snapshot data on a monthly basis (the day this runs changes each month but it runs sometimes once per month, sometimes twice). I have a unique col ID which I want to check if it is still present in the next months snapshot or if it had been dropped or added. When there is 2 dates present in the month I want to compare to the previous month not the most recent run for example

ID  snapshot
1   01/01/2022
2   01/01/2022
3   01/01/2022
1   03/02/2022
2   03/02/2022
4   03/02/2022
1   15/02/2022
3   15/02/2022
4   15/02/2022

`

Output should look like

snapshot   same   dropped  added  total_current
01/01/2022  0     0        3      3
03/02/2022  2     1        1      3
15/02/2022  2     1        1      3

Query I have, I know its not 100% right as it rolls all the same month into one date, but its where I've got to so far and my brain can't think how to separate it out. I'm thinking I maybe need to extract month and year from the dates, and then bring in the current snapshot date

SELECT
    coalesce(last_day(c.snapshot), add_months(last_day(p.snapshot), 1) )         
    ,SUM(CASE WHEN c.ID is not null and p.ID is not null THEN 1 ELSE 0 END) AS same
    ,SUM(CASE WHEN c.ID is null and p.ID is not null THEN 1 ELSE 0 END) AS dropped
    ,SUM(CASE WHEN c.ID is not null and p.ID is null THEN 1 ELSE 0 END) AS added
    ,count(c.snapshot) as total_current
FROM table c
full outer join table p
 ON c.ID = p.ID
 AND LAST_DAY(c.snapshot) = add_months(last_day(p.snapshot), 1))
GROUP BY 1;

Solution

  • The approach of self outer join is what I'd do too. The issue is that the DROPPED column needs to reference a different group by expression. It's actually not possible to group by that expression even with a chained CTE. The next best thing is to reference the previous row's TOTAL_CURRENT value using a lag window function and do the math:

    with RAW as
    (
    select $1 as ID, $2 as SNAPSHOT from values
    (1,   '01/01/2022'),
    (2,   '01/01/2022'),
    (3,   '01/01/2022'),
    (1,   '03/02/2022'),
    (2,   '03/02/2022'),
    (4,   '03/02/2022'),
    (1,   '15/02/2022'),
    (3,   '15/02/2022'),
    (4,   '15/02/2022')
    ), DATA as
    (
    select ID, to_date(SNAPSHOT, 'DD/MM/YYYY') SNAPSHOT from RAW
    )
    select   CM.SNAPSHOT
            ,sum(iff(CM.ID = PM.ID,1,0)) SAME
            ,sum(iff(PM.ID is null,1,0)) ADDED
            ,count(*) TOTAL_CURRENT
            ,zeroifnull(lag(TOTAL_CURRENT) over (order by CM.SNAPSHOT) - TOTAL_CURRENT + ADDED) DROPPED
    from DATA CM full outer join DATA PM
        on CM.ID = PM.ID
            and last_day(PM.SNAPSHOT) = add_months(last_day(CM.SNAPSHOT), -1)
    group by CM.SNAPSHOT
    having CM.SNAPSHOT is not null
    order by CM.SNAPSHOT
    

    Note that in order to simplify the logic, the DROPPED column is last. This enables reference to previously calculated columns. If it's important to have the columns in the specified order, you can simply add another table expression to the CTE that rearranges the order of the columns.

    SNAPSHOT SAME ADDED TOTAL_CURRENT DROPPED
    2022-01-01 00:00:00 0 3 3 0
    2022-02-03 00:00:00 2 1 3 1
    2022-02-15 00:00:00 2 1 3 1