Search code examples
sqlsnowflake-cloud-data-platformmode

how to assure a deterministic result of a query that uses mode() in snowflake


I use snowflake and I want to use multiple mode() expressions in one select statement. So it looks like:

SELECT
x,
y,
mode(col1),
mode(col2),
...
mode(col15)
FROM table
GROUP BY x, y

My problem is that it yields nondeterministic output in case of ties. The documentation does not explain exactly how are the ties resolved. It says only:

If there is a tie for most frequent value (two or more values occur as frequently as each other, and more frequently than any other value), MODE returns one of those values.

https://docs.snowflake.net/manuals/sql-reference/functions/mode.html I need some workaround to get the equivalent of mode(), which will always results in a deterministic output. Something like: use mode(), but in case of a tie order by some column and select the first value.

I do not supply example to replicate the nondeterministic result, because it seems to occur only with bigger data sets or complex queries.


Solution

  • so mode seems to prefer the first value it see in a tie breaker.

    with data as (
    select x, col1, col2, col3 from values (1, 1, 1, 3), (1, 1, 2,3), (1, 2, 2,3)
        ,(4, 1, 20, 30), (4, 1, 2, 3), (4, 2, 2, 30), (4,2,20,3) v(x,col1,col2,col3)
    )
    select x
        ,mode(col1)
        ,mode(col2)
        ,mode(col3)
    from data 
    group by 1
    order by 1;
    

    swapping the first value of the 2/20 or 3/30 pair shows this.

    so build a pattern trying to solve this in one expression:

    with data as (
    select x, col1, col2, col3 from values (1, 1, 1, 3), (1, 1, 2,3), (1, 2, 2,3)
        ,(4, 1, 20, 30), (4, 1, 2, 3), (4, 2, 2, 30), (4,2,20,3) v(x,col1,col2,col3)
    )
    select x
        ,col1
        ,col2
        ,col3
        ,count(col1)over(partition by x,col1) c_col1
        ,count(col2)over(partition by x,col2) c_col2
        ,count(col3)over(partition by x,col3) c_col3
    from data ;
    

    lends it self to:

    with data as (
    select x, col1, col2, col3 from values (1, 1, 1, 3), (1, 1, 2,3), (1, 2, 2,3)
        ,(4, 1, 20, 30), (4, 1, 2, 3), (4, 2, 2, 30), (4,2,20,3) v(x,col1,col2,col3)
    )
    select x
        ,col1
        ,col2
        ,col3 
        ,row_number() over (partition by x order by c_col1 desc, col1) as r1
        ,row_number() over (partition by x order by c_col2 desc, col2) as r2
        ,row_number() over (partition by x order by c_col3 desc, col3) as r3
    from (
      select x
          ,col1
          ,col2
          ,col3
          ,count(col1)over(partition by x,col1) c_col1
          ,count(col2)over(partition by x,col2) c_col2
          ,count(col3)over(partition by x,col3) c_col3
      from data 
    )
    order by 1;
    

    with these result though:

    X   COL1    COL2    COL3    R1  R2  R3
    1   1   2   3   2   1   1
    1   2   2   3   3   2   2
    1   1   1   3   1   3   3
    4   1   2   3   2   1   1
    4   2   20  3   4   4   2
    4   2   2   30  3   2   3
    4   1   20  30  1   3   4
    

    you cannot use logic like

    QUALIFY row_number() over (partition by x order by c_col1 desc, col1) = 1
      AND row_number() over (partition by x order by c_col2 desc, col2) = 1
      AND row_number() over (partition by x order by c_col3 desc, col3 desc) = 1
    

    to pick the best, as the best row for each column are not aligned.

    which leads to a CTE (or subquery) for each column, much in the pattern that Gorndon showed.

    with data as (
    select x, col1, col2, col3 from values (1, 1, 1, 3), (1, 1, 2,3), (1, 2, 2,3)
        ,(4, 1, 20, 30), (4, 1, 2, 3), (4, 2, 2, 30), (4,2,20,3) v(x,col1,col2,col3)
    ),col1_m as (
        select x, col1, count(*) as c 
        from data 
        group by 1,2
        QUALIFY row_number() over (partition by x order by c desc, col1) = 1
    ),col2_m as (
        select x, col2, count(*) as c 
        from data 
        group by 1,2
        QUALIFY row_number() over (partition by x order by c desc, col2) = 1
    ),col3_m as (
        select x, col3, count(*) as c 
        from data 
        group by 1,2
        QUALIFY row_number() over (partition by x order by c desc, col3) = 1
    ), base as (
    select distinct x from data
    )
    select b.x
        ,c1.col1
        ,c2.col2
        ,c3.col3
    from base as b
    left join col1_m as c1 on b.x = c1.x
    left join col2_m as c2 on b.x = c2.x
    left join col3_m as c3 on b.x = c3.x
    order by 1;
    

    which gives the results you are expecting

    X   COL1    COL2    COL3
    1   1   2   3
    4   1   2   3
    

    but you will need to expand X to be the set of things (x,y,..) that you care for etc.