Search code examples
sqlsnowflake-cloud-data-platformcommon-table-expressionrankrow-number

Snowflake row_number function is returning inconsistent results between CTE and view


I currently have a dataset, where it is possible for an person to have multiple records in one month, and I only want to take the first. To do this, I am using the row_number function, and my code is along the lines of the following:

select 
    person_id, date,
    year(date) as year, month(date) as month,
    row_number() over(partition by person_id, month order by date)
from 
    data
where 
    (person_id, date, extract(year from date), 
     extract(month from date)) not in (select person_id, min_date, 
                                              extract(year from min_date), 
                                              extract(month from min_date) from first_date)

For a person/month that only has two rows, the row numbers are 4&5 respectively. The Rank function returns the same result.

I tried reference the code above as a CTE and define the row number when selecting from it, and ran into the same issue.

with cte_1 as (
    select person_id, date,
    year(date) as year, month(date) as month
    from data
    where 
        (
        person_id, 
        date, 
        extract(year from date), 
        extract(month from date)
        ) not in (
            select person_id, 
            min_date, 
            extract(year from min_date), 
            extract(month from min_date) from first_date
        )
)
select *, row_number() over(partition by person_id, month order by date)
from cte_1

I tried ANOTHER CTE, and it worked correctly.

with cte_1 as (
    select person_id, date,
    year(date) as year, month(date) as month
    from data
    where 
        (
        person_id, 
        date, 
        extract(year from date), 
        extract(month from date)
        ) not in (
            select person_id, 
            min_date, 
            extract(year from min_date), 
            extract(month from min_date) from first_date
        )
)
,cte_2 as (
select *
from cte_1
)

select *, row_number() over(partition by person_id, month order by date)
from cte_2

Does anyone know why this could possibly be happening?


Solution

  • The issue here ended up being pretty simple, I needed to partition over id, year, and month. I was seeing strange results, because for instance January 2018, January 2019, and January 2020 would all be in the same partition. Since I was only filtering for January 2020 I would only see the higher row numbers.

    It is still strange to me that Adding an additional CTE seemingly fixes the issue, but it is likely that it just obscures it rather than fixing anything.