Search code examples
snowflake-cloud-data-platformanalytic-functions

Summing over a numeric column with moving window of varying size in Snowflake


I have a sample dataset given as follows;

time  |  time_diff  | amount
time1 |  time1-time2 | 1000
time2 |  time2-time3 | 2000
time3 |  time3-time4 | 3000
time4 |  time4-time5 | 4500
time5 |   NULL       | 1000

Quick explanation; first column gives time of transaction, second column gives difference with next row to get transaction interval(in hours), and third column gives money made in a particular transaction. We have sorted the data in ascending order using time column.

Some values are given as;

time  |  time_diff  | amount
time1 |  2.         | 1000
time2 |  3.         | 2000
time3 |  1.         | 3000
time4 |  19.        | 4500
time5 |   NULL       | 1000

The goal is to find the total transaction for a given time, which occurred within 24 hours of that transaction. For example, the output for time1 shd be; 1000+2000+3000=6000. Because if we add the value at time4, the total time interval becomes 25, hence we omit the value of 4500 from the sum.

Example output:

time  |  amount
time1 | 6000       
time2 | 9500         
time3 | 7500         
time4 | 4500        
time5 | 1000

The concept of Mong window sum should work, in my knowledge, but here the width of the window is variable. Thats the challenge I am facing.Can I kindly get some help here?


Solution

  • You could ignore the time_diff column and use a theta self-join based on a timestamp range, like this:

    WITH srctab AS (    SELECT TO_TIMESTAMP_NTZ('2020-04-15 00:00:00') AS "time", 1000::INT AS "amount"
              UNION ALL SELECT TO_TIMESTAMP_NTZ('2020-04-15 00:02:00'), 2000::INT
              UNION ALL SELECT TO_TIMESTAMP_NTZ('2020-04-15 00:05:00'), 3000::INT
              UNION ALL SELECT TO_TIMESTAMP_NTZ('2020-04-15 00:06:00'), 4500::INT
              UNION ALL SELECT TO_TIMESTAMP_NTZ('2020-04-16 00:01:00'), 1000::INT
              )
    SELECT t1."time", SUM(t2."amount") AS tot
    FROM srctab t1
    JOIN srctab t2 ON t2."time" BETWEEN t1."time" AND TIMESTAMPADD(HOUR, +24, t1."time")
    GROUP BY t1."time"
    ORDER BY t1."time";
    

    Minor detail: if your second column gives the time difference with the next row then I'd say the first value should be 10500 (not 6000) because it's only your 5th transaction of 1000 which is more than 24 hours ahead... I'm guessing your actual timestamps are at 0, 2, 5, 6 and 25 hours?