Search code examples
sqldatetimesubqueryaggregate-functionspresto

How to compute window function for each nth row in Presto?


I am working with a table that contains timeseries data, with a row for each minute for each user.

I want to compute some aggregate functions on a rolling window of N calendar days.

This is achieved via

SELECT
SOME_AGGREGATE_FUN(col) OVER (
        PARTITION BY user_id
        ORDER BY timestamp
        ROWS BETWEEN (60 * 24 * N) PRECEDING AND CURRENT ROW
) as my_col
FROM my_table

However, I am only interested in the result of this at a daily scale.

i.e. I want the window to be computed only at 00:00:00, but I want the window itself to contain all the minute-by-minute data to be passed into my aggregate function.

Right now I am doing this:

WITH agg_results AS (
    SELECT
    SOME_AGGREGATE_FUN(col) OVER (
        PARTITION BY user_id
        ORDER BY timestamp_col
        ROWS BETWEEN (60 * 24 * N) PRECEDING AND CURRENT ROW
    )
    FROM my_table
)

SELECT * FROM agg_results
WHERE 
timestamp_col = DATE_TRUNC('day', "timestamp_col")

This works in theory, but it does 60 * 24 more computations that necessary, resulting in the query being super slow.

Essentially, I am trying to find a way to make the right window bound skip rows based on a condition. Or, if it is simpler to implement, for every nth row (as I have a constant number of rows for each day).


Solution

  • I don't think that's possible with window functions. You could switch to a subquery instead, assuming that your aggregate function works as a regular aggregate function too (that is, without an OVER() clause):

    select 
        timestamp_col,
        (
            select some_aggregate_fun(t1.col)
            from my_table t1
            where 
                t1.user_id = t.user_id
                and t1.timestamp_col >= t.timestamp_col - interval '1' day 
                and t1.timestamp_col <= t.timestamp_col
        )
    from my_table t
    where timestamp_col = date_trunc('day', timestamp_col)
    

    I am unsure that this would perform better than your original query though; you might need to assess that against your actual dataset.

    You can change interval '1' day to the actual interval you want to use.