Search code examples
sqlwhere-clausesnowflake-cloud-data-platformwindow-functions

Filter by partition in Snowflake


I want to filter out records where there is a record with a start_time before than start_time of created status for each id.

e.g. below id A has 'failed' status before 'created' based on start_time. So it needs to be filtered. Whereas id B has first 'created', then another status which is acceptable.

enter image description here

So the expected result is only this, but I'm looking for a scalable solution that works for thousands of rows.

enter image description here

WITH t1 AS (
SELECT 'A' AS id, 'failed' AS status, '2021-05-18 18:30:00'::timestamp AS start_time UNION ALL
SELECT 'A' AS id, 'created' AS status, '2021-05-24 11:30:00'::timestamp AS start_time UNION ALL
SELECT 'A' AS id, 'created' AS status, '2021-05-24 12:00:00'::timestamp AS start_time UNION ALL
SELECT 'B' AS id, 'created' AS status, '2021-05-19 18:30:00'::timestamp AS start_time UNION ALL
SELECT 'B' AS id, 'successful' AS status, '2021-05-20 11:30:00'::timestamp AS start_time
    )
SELECT *
FROM t1

Solution

  • There are multiple ways to achieve this but here is an approach using first_value.

    with t1 (id, status, start_time) as 
    (select 'a', 'failed', '2021-05-18 18:30:00'::timestamp union all
     select 'a', 'created', '2021-05-24 11:30:00'::timestamp union all
     select 'a', 'created', '2021-05-24 12:00:00'::timestamp union all
     select 'b', 'created', '2021-05-19 18:30:00'::timestamp union all
     select 'b', 'successful', '2021-05-20 11:30:00'::timestamp)
    
    select *
    from t1
    qualify first_value(status) over (partition by id order by start_time asc) = 'created'
    

    All you're doing is making sure that the first status for any given id is "created". Think of qualify clause as having clause for window functions. You can also break this down into a subquery if you find that more readable.

    Note: Solution above will also keep records that have only "created" status. If you want to make sure there are least two different statuses per id, modify it to

    select *
    from t1
    qualify first_value(status) over (partition by id order by start_time asc) = 'created'
            and 
            count(distinct status) over (partition by id) > 1;