Search code examples
sqlsnowflake-cloud-data-platformwindow-functions

"Year-on-year" with window functions


I'm doing year on year analysis by doing joins. I'm joining the same table for every year but since I'm using another tool to build my SQL its not so 'dynamic'. If I could solve this with window functions it would be better. So any suggestion is appreciated :D

The idea is to do this by hour. That is, I want to compare sales lets say from 2022-04-05 hour 8 with sales from 2021-04-05 hour 8 and from 2020-04-05 hour 8.

My data is aggregated by hour:

Store Timestamp Sales
1 2019-04-05T08:00:00Z 10000
1 2020-04-05T08:00:00Z 12000
1 2021-04-05T08:00:00Z 15000
1 2022-04-05T08:00:00Z 20000
2 2019-04-05T08:00:00Z 13000
2 2020-04-05T08:00:00Z 16000
2 2021-04-05T08:00:00Z 19000
2 2022-04-05T08:00:00Z 22000

Desired result (order may begin with this year) timestamps are not needed. I added them just to clarify:

Store Timestamp_1 Sales_1 Timestamp_2 Sales_2 Timestamp_3 Sales_3
1 2019-04-05T08:00:00Z 10000 2020-04-05T08:00:00Z 12000 2021-04-05T08:00:00Z 15000
2 2019-04-05T08:00:00Z 13000 2020-04-05T08:00:00Z 16000 2021-04-05T08:00:00Z 19000

Any ideas? Thanks in Advance


Solution

  • Not really your answer but if you only have one days worth of hours

    SELECT 
        store
        ,hour(date)
        ,array_agg(object_construct(date::text, sales)) within group (order by date) as hour_history
    FROM data_table
    GROUP BY 1,2
    ORDER BY 1,2;
    

    gives:

    STORE HOUR(DATE) HOUR_HISTORY
    1 8 [ { "2019-04-05 08:00:00.000": 10000 }, { "2020-04-05 08:00:00.000": 12000 }, { "2021-04-05 08:00:00.000": 15000 }, { "2022-04-05 08:00:00.000": 20000 } ]
    2 8 [ { "2019-04-05 08:00:00.000": 13000 }, { "2020-04-05 08:00:00.000": 16000 }, { "2021-04-05 08:00:00.000": 19000 }, { "2022-04-05 08:00:00.000": 22000 } ]

    Thus:

    SELECT store
        ,hour_history[0].date::timestamp as Timestamp_1
        ,hour_history[0].sales::number as Sales_1
        ,hour_history[1].date::timestamp as Timestamp_2
        ,hour_history[1].sales::number as Sales_2
        ,hour_history[2].date::timestamp as Timestamp_3
        ,hour_history[2].sales::number as Sales_3
    FROM (
    SELECT 
        store
        ,hour(date)
        ,array_agg(object_construct('date', date::text, 'sales', sales)) within group (order by date) as hour_history
    FROM data_table
    GROUP BY 1,2
    )
    ORDER BY 1;
    

    does give:

    STORE TIMESTAMP_1 SALES_1 TIMESTAMP_2 SALES_2 TIMESTAMP_3 SALES_3
    1 2019-04-05 08:00:00.000 10,000 2020-04-05 08:00:00.000 12,000 2021-04-05 08:00:00.000 15,000
    2 2019-04-05 08:00:00.000 13,000 2020-04-05 08:00:00.000 16,000 2021-04-05 08:00:00.000 19,000

    If you have many months, day, hours worth of data this works for the inner loop:

    SELECT 
        store
        ,month(date)
        ,day(date)
        ,hour(date)
        ,array_agg(object_construct('date', date::text, 'sales', sales)) within group (order by date) as hour_history
    FROM data_table
    GROUP BY 1,2,3,4
    

    AKA:

    WITH data_table AS (
        SELECT * FROM VALUES
            (1,'2019-04-05T08:00:00Z'::timestamp,10000),
            (1,'2020-04-05T08:00:00Z'::timestamp,12000),
            (1,'2021-04-05T08:00:00Z'::timestamp,15000),
            (1,'2022-04-05T08:00:00Z'::timestamp,20000),
    
            (1,'2019-03-05T08:00:00Z'::timestamp,10001),
            (1,'2020-03-05T08:00:00Z'::timestamp,12001),
            (1,'2021-03-05T08:00:00Z'::timestamp,15001),
            (1,'2022-03-05T08:00:00Z'::timestamp,20001),
    
            (1,'2019-04-04T08:00:00Z'::timestamp,10002),
            (1,'2020-04-04T08:00:00Z'::timestamp,12002),
            (1,'2021-04-04T08:00:00Z'::timestamp,15002),
            (1,'2022-04-04T08:00:00Z'::timestamp,20002),
    
        
            (2,'2019-04-05T08:00:00Z'::timestamp,13000),
            (2,'2020-04-05T08:00:00Z'::timestamp,16000),
            (2,'2021-04-05T08:00:00Z'::timestamp,19000),
            (2,'2022-04-05T08:00:00Z'::timestamp,22000)
        t(store, date, sales)
    )
    SELECT store
        ,hour_history[0].date::timestamp as Timestamp_1
        ,hour_history[0].sales::number as Sales_1
        ,hour_history[1].date::timestamp as Timestamp_2
        ,hour_history[1].sales::number as Sales_2
        ,hour_history[2].date::timestamp as Timestamp_3
        ,hour_history[2].sales::number as Sales_3
    FROM (
    SELECT 
        store
        ,month(date)
        ,day(date)
        ,hour(date)
        ,array_agg(object_construct('date', date::text, 'sales', sales)) within group (order by date) as hour_history
    FROM data_table
    GROUP BY 1,2,3,4
    )
    ORDER BY 1;
    

    gives:

    STORE TIMESTAMP_1 SALES_1 TIMESTAMP_2 SALES_2 TIMESTAMP_3 SALES_3
    1 2019-04-05 08:00:00.000 10,000 2020-04-05 08:00:00.000 12,000 2021-04-05 08:00:00.000 15,000
    1 2019-03-05 08:00:00.000 10,001 2020-03-05 08:00:00.000 12,001 2021-03-05 08:00:00.000 15,001
    1 2019-04-04 08:00:00.000 10,002 2020-04-04 08:00:00.000 12,002 2021-04-04 08:00:00.000 15,002
    2 2019-04-05 08:00:00.000 13,000 2020-04-05 08:00:00.000 16,000 2021-04-05 08:00:00.000 19,000

    What you will note is you example has 4 years of data, and you are throwing away to 2022 data.