I have snapshot data on a monthly basis (the day this runs changes each month but it runs sometimes once per month, sometimes twice). I have a unique col ID which I want to check if it is still present in the next months snapshot or if it had been dropped or added. When there is 2 dates present in the month I want to compare to the previous month not the most recent run for example
ID snapshot
1 01/01/2022
2 01/01/2022
3 01/01/2022
1 03/02/2022
2 03/02/2022
4 03/02/2022
1 15/02/2022
3 15/02/2022
4 15/02/2022
`
Output should look like
snapshot same dropped added total_current
01/01/2022 0 0 3 3
03/02/2022 2 1 1 3
15/02/2022 2 1 1 3
Query I have, I know its not 100% right as it rolls all the same month into one date, but its where I've got to so far and my brain can't think how to separate it out. I'm thinking I maybe need to extract month and year from the dates, and then bring in the current snapshot date
SELECT
coalesce(last_day(c.snapshot), add_months(last_day(p.snapshot), 1) )
,SUM(CASE WHEN c.ID is not null and p.ID is not null THEN 1 ELSE 0 END) AS same
,SUM(CASE WHEN c.ID is null and p.ID is not null THEN 1 ELSE 0 END) AS dropped
,SUM(CASE WHEN c.ID is not null and p.ID is null THEN 1 ELSE 0 END) AS added
,count(c.snapshot) as total_current
FROM table c
full outer join table p
ON c.ID = p.ID
AND LAST_DAY(c.snapshot) = add_months(last_day(p.snapshot), 1))
GROUP BY 1;
The approach of self outer join is what I'd do too. The issue is that the DROPPED
column needs to reference a different group by expression. It's actually not possible to group by that expression even with a chained CTE. The next best thing is to reference the previous row's TOTAL_CURRENT
value using a lag window function and do the math:
with RAW as
(
select $1 as ID, $2 as SNAPSHOT from values
(1, '01/01/2022'),
(2, '01/01/2022'),
(3, '01/01/2022'),
(1, '03/02/2022'),
(2, '03/02/2022'),
(4, '03/02/2022'),
(1, '15/02/2022'),
(3, '15/02/2022'),
(4, '15/02/2022')
), DATA as
(
select ID, to_date(SNAPSHOT, 'DD/MM/YYYY') SNAPSHOT from RAW
)
select CM.SNAPSHOT
,sum(iff(CM.ID = PM.ID,1,0)) SAME
,sum(iff(PM.ID is null,1,0)) ADDED
,count(*) TOTAL_CURRENT
,zeroifnull(lag(TOTAL_CURRENT) over (order by CM.SNAPSHOT) - TOTAL_CURRENT + ADDED) DROPPED
from DATA CM full outer join DATA PM
on CM.ID = PM.ID
and last_day(PM.SNAPSHOT) = add_months(last_day(CM.SNAPSHOT), -1)
group by CM.SNAPSHOT
having CM.SNAPSHOT is not null
order by CM.SNAPSHOT
Note that in order to simplify the logic, the DROPPED column is last. This enables reference to previously calculated columns. If it's important to have the columns in the specified order, you can simply add another table expression to the CTE that rearranges the order of the columns.
SNAPSHOT | SAME | ADDED | TOTAL_CURRENT | DROPPED |
---|---|---|---|---|
2022-01-01 00:00:00 | 0 | 3 | 3 | 0 |
2022-02-03 00:00:00 | 2 | 1 | 3 | 1 |
2022-02-15 00:00:00 | 2 | 1 | 3 | 1 |