Search code examples
snowflake-cloud-data-platformaggregation

Snowflake: Aggregating by a sliding window (the past 60 mins) for a dataset where the sampling frequency is non-uniform


I have data with non-uniform sampling dist. I want to the aggregate data on a rolling/ sliding basis (the past 60 mins).

enter image description here

In order to achieve an hourly average (partitioned by city), I used to following code which worked.

SELECT *,
       AVG(VALUE) OVER (PARTITION BY CITY, DATE_AND_HOUR ORDER BY TIMESTAMP
FROM 
(
SELECT *,
       date_trunc('HOUR', TIMESTAMP) as DATE_AND_Hour

FROM SAMPLE_DATA
)

However, my desired output is as follows:

enter image description here

I know Snowflake doesn't support RANGE and I can't use specify which rows BETWEEN in a windows function as my sampling dist is non-uniform.

I read some potential solutions on this page but they don't work in snowflake: sum last n days quantity using sql window function Essentially, it's an analogous problem.


Solution

  • Firstly what you show as "average" in your example is the "sum", and you first "Shanghia" result is including a "Beijing" result.

    You have two options, build a fixed sized window dataset (build partials for each minute) and then use window frame of fixed size over that, OR self-join and just aggregate those (as Felipe has shown).

    If you have very dense data, you might find the former more performant, and if you have sparse data, the later approach should be faster, and is definitely faster to code.

    So the simple first:

    with data(city, timestamp, value) as (
        select column1, try_to_timestamp(column2, 'yyyy/mm/dd hh:mi'), column3 from values 
        ('beijing', '2022/05/25 10:33', 22),
        ('beijing', '2022/05/25 10:37', 20),
        ('beijing', '2022/05/25 11:36', 29),
        ('beijing', '2022/05/26 11:36', 28),
        ('beijing', '2022/05/26 10:00', 21),
        ('shanghai', '2022/05/26 11:00', 33),
        ('shanghai', '2022/05/26 11:46', 35),
        ('shanghai', '2022/05/26 12:40', 37)
    )
    
    select a.*
        ,avg(b.value) as p60_avg
        ,count(b.value)-1 as p60_count
        ,sum(b.value) as p60_sum
    from data as a
    left join data as b
        on a.city = b.city and b.timestamp between dateadd(hour, -1, a.timestamp) and a.timestamp
    group by 1,2,3
    order by 1,2
    

    gives:

    CITY TIMESTAMP VALUE P60_AVG P60_COUNT P60_SUM
    beijing 2022-05-25 10:33:00.000 22 22 0 22
    beijing 2022-05-25 10:37:00.000 20 21 1 42
    beijing 2022-05-25 11:36:00.000 29 24.5 1 49
    beijing 2022-05-26 10:00:00.000 21 21 0 21
    beijing 2022-05-26 11:36:00.000 28 28 0 28
    shanghai 2022-05-26 11:00:00.000 33 33 0 33
    shanghai 2022-05-26 11:46:00.000 35 34 1 68
    shanghai 2022-05-26 12:40:00.000 37 36 1 72

    The dense version:

    with data(city, timestamp, value) as (
        select column1, try_to_timestamp(column2, 'yyyy/mm/dd hh:mi'), column3 from values 
        ('beijing', '2022/05/25 10:33', 22),
        ('beijing', '2022/05/25 10:37', 20),
        ('beijing', '2022/05/25 11:36', 29),
        ('beijing', '2022/05/26 11:36', 28),
        ('beijing', '2022/05/26 10:00', 21),
        ('shanghai', '2022/05/26 11:00', 33),
        ('shanghai', '2022/05/26 11:46', 35),
        ('shanghai', '2022/05/26 12:40', 37)
    ), filled_time as (
        select city,
            dateadd(minute, row_number() over(partition by city order by null)-1, min_t) as timestamp
        from (
            select 
                city, min(timestamp) as min_t, max(timestamp) as max_t
            from data
            group by 1
        ), table(generator(ROWCOUNT => 10000))
        qualify timestamp <= max_t
    )
    select
        ft.city
        ,ft.timestamp
        ,avg(d.value) over (order by ft.timestamp ROWS BETWEEN 60 PRECEDING AND current row ) as p60_avg
    from filled_time as ft
    left join data as d
        on ft.city = d.city and ft.timestamp = d.timestamp
    order by 1,2;
    

    gives:

    CITY TIMESTAMP P60_AVG
    beijing 2022-05-25 10:33:00.000 22
    beijing 2022-05-25 10:34:00.000 22
    beijing 2022-05-25 10:35:00.000 22
    beijing 2022-05-25 10:36:00.000 22
    beijing 2022-05-25 10:37:00.000 21
    beijing 2022-05-25 10:38:00.000 21
    beijing 2022-05-25 10:39:00.000 21
    beijing 2022-05-25 10:40:00.000 21
    beijing 2022-05-25 10:41:00.000 21
    beijing 2022-05-25 10:42:00.000 21
    beijing 2022-05-25 10:43:00.000 21
    beijing 2022-05-25 10:44:00.000 21
    beijing 2022-05-25 10:45:00.000 21
    beijing 2022-05-25 10:46:00.000 21
    snip...

    And those "extra" rows could be dumped with a qualify

    select
        ft.city
        ,ft.timestamp
        ,avg(d.value) over (order by ft.timestamp ROWS BETWEEN 60 PRECEDING AND current row ) as p60_avg
        --,count(b.value)-1 as p60_count
        --,sum(b.value) as p60_sum
    from filled_time as ft
    left join data as d
        on ft.city = d.city and ft.timestamp = d.timestamp
    qualify d.value is not null
    order by 1,2;