Search code examples
sqlpostgresqlaggregate-functionswindow-functions

Defining relative percentage frame limits in a postgres window frame


I have a query where I am trying to find the average of energy consumption over a group of service locations that each fall within ±5% of each other for a particular month. I know that range and rows function can't define a particular range of numbers and instead use preceding and following clauses but how can I write this query to suit my needs?

WITH energy_consumption_cte AS (
    SELECT
        sl.id,
        sl.sqft,
        SUM(CASE WHEN dl.log_label = 'energy use' THEN dl.value END) AS monthly_consumption
    FROM
        serv_location sl
    INNER JOIN
        device d ON d.service_loc_id = sl.id
    INNER JOIN
        device_log dl ON dl.device_id = d.id
    WHERE
        dl."timestamp" >= '2022-08-01 00:01:00.000'::timestamp
        AND dl."timestamp" < '2022-09-01 00:01:00.000'::timestamp
        AND sl.sqft IS NOT NULL
    GROUP BY
        sl.id, sl.sqft
)

SELECT
    id,
    sqft,
    monthly_consumption,
    (monthly_consumption / AVG(monthly_consumption) OVER w) * 100 
    AS average_monthly_for_sqft_range
FROM
    energy_consumption_cte
window w as (PARTITION BY sqft ORDER BY sqft range between 
    ROUND(sqft * 0.95) and ROUND(sqft * 1.05))
ORDER BY
    sqft ASC;

Solution

  • Window frame range mode can define particular range. To use an aggregate function that'll include values from records 5% up/down from the current one, you should think range between 0.05*sqft preceding and 0.05*sqft following. Problem is, you can't use variables there.

    In the offset PRECEDING and offset FOLLOWING frame options, the offset must be an expression not containing any variables, aggregate functions, or window functions.

    If you try to refer to the sqft column/field, you'll get this:

    ERROR:  argument of RANGE must not contain variables
    LINE 27:              RANGE BETWEEN sqft*0.05 PRECEDING
                                        ^
    

    You can work around that limitation by first calculating % change from row to row using lag() or lead(), then collecting that into a stepping sum(), so that it maintains the initial order of sqft. Once you have that, you can use constant offset in the frame clause to reach 5% back and forth. Demo at db<>fiddle:

    WITH energy_consumption_cte AS (
        SELECT sl.id,
               sl.sqft,
               SUM(dl.value)filter(where dl.log_label='energy use') AS monthly_consumption
        FROM serv_location sl
        INNER JOIN device d ON d.service_loc_id = sl.id
        INNER JOIN device_log dl ON dl.device_id = d.id
        WHERE dl."timestamp" >= '2022-08-01 00:01:00.000'::timestamp
          AND dl."timestamp" < '2022-09-01 00:01:00.000'::timestamp
          AND sl.sqft IS NOT NULL
        GROUP BY sl.id, sl.sqft 
    ),square_footage_steps_cte AS (
        SELECT *, 1-lag(sqft,1,sqft)over(order by sqft)/sqft::numeric AS sqft_step
        FROM energy_consumption_cte
        ORDER BY sqft ASC
    ),square_footage_steps_cumulative_cte AS (
        SELECT *, sum(sqft_step)over(order by sqft) AS sqft_step_sum
        FROM square_footage_steps_cte
        ORDER BY sqft ASC)
    SELECT id,
           sqft,
           monthly_consumption,
           AVG(monthly_consumption) OVER w AS average_monthly_for_sqft_range,
           array_agg(id)over w as rows_in_frame
    FROM square_footage_steps_cumulative_cte 
    WINDOW w as (ORDER BY sqft_step_sum
                 RANGE BETWEEN 0.05 PRECEDING
                       AND     0.05 FOLLOWING)
    ORDER BY sqft ASC;
    
    id sqft monthly_consumption average_monthly_for_sqft_range rows_in_frame
    1 100 400 700.0000000000000000 {1,2}
    2 105 1000 1000.0000000000000000 {1,2,3}
    3 110 1600 1600.0000000000000000 {2,3,4}
    4 115 2200 1900.0000000000000000 {3,4}
    5 200 2800 3100.0000000000000000 {5,6}
    6 210 3400 3100.0000000000000000 {5,6}