Search code examples
sqlsnowflake-cloud-data-platformwindow-functions

Understanding window function better (Snowflake)


I am trying to understand window functions in Snowflake better. I was pretty fine handling simple ones but I stumbled over something that I don't get.

In below code, the commented out line causes an "is not a valid group by expression" error while the one above with double avg works fine and even returns the right value. I don't understand why the double AVG is required to make this work.

SELECT 
    dex.seller,
    date_trunc('month',dex.CREATED_AT_DATE) AS date_base,
    AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)),
    AVG(AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate))) OVER (PARTITION BY dex.seller,date_base)--,
    --AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)) OVER (PARTITION BY dex.seller,date_base)
FROM 
    REP_DATA_SOURCES.DEAL_EXTENDED dex
GROUP BY 1,2

This is a simplification of the actual code.


Solution

  • It is all about logical order of execution.

    Qualify In the execution order of a query, QUALIFY is therefore evaluated after window functions are computed. Typically, a SELECT statement’s clauses are evaluated in the order shown below:

    • From
    • Where
    • Group by
    • Having
    • Window
    • QUALIFY
    • Distinct
    • Order by
    • Limit

    Window/analytical functions are evaluated after GROUP BY.

    Sceanrio 1: Correct

    SELECT 
        dex.seller,
        date_trunc('month',dex.CREATED_AT_DATE) AS date_base,
        AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)),
        AVG(AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate))) 
        OVER (PARTITION BY dex.seller,date_base)--,
    FROM     REP_DATA_SOURCES.DEAL_EXTENDED dex
    GROUP BY 1,2;
    

    Scenario 2: Incorrect (analytical average over value that is not part of group by or wrapped with aggregate function)

    SELECT 
        dex.seller,
        date_trunc('month',dex.CREATED_AT_DATE) AS date_base,
        AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)),
        AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)) OVER (PARTITION BY dex.seller,date_base)
    FROM  REP_DATA_SOURCES.DEAL_EXTENDED dex
    GROUP BY 1,2;
    

    If we skip the analitical part of query, it is incorrect itself at the GROUP BY level:

    SELECT 
        dex.seller,
        date_trunc('month',dex.CREATED_AT_DATE) AS date_base,
        AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)),
        DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)
    FROM  REP_DATA_SOURCES.DEAL_EXTENDED dex
    GROUP BY 1,2;
    

    Expression DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate) is not:

    • part of GROUP BY
    • is not aggregated value(no agg function around it)