Search code examples
sqlpostgresqlaggregate-functionswindow-functionssql-limit

PostgreSQL window function with LIMIT


The query below groups the results from first into 4 equally spaced date bins and aggregates an average for the_value in each bin.

WITH first as(
SELECT
    extract(EPOCH FROM foo.t_date) as the_date,
    foo_val as the_value
FROM bar
INNER JOIN foo
ON
    foo.user_id = bar.x_id
    and
    foo.user_name = 'xxxx'
)
SELECT bin, round(sum(bin_sum) OVER w /sum(bin_ct) OVER w, 2) AS running_avg
FROM  (
   SELECT width_bucket(first.the_date
                     , x.min_epoch, x.max_epoch, x.bins) AS bin
        , sum(first.the_value) AS bin_sum
        , count(*)   AS bin_ct
   FROM   first
       , (SELECT MIN(first.the_date) AS min_epoch
               , MAX(first.the_date) AS max_epoch
               , 4 AS bins
          FROM  first
         ) x
   GROUP  BY 1
   ) sub
WINDOW w AS (ORDER BY bin)
ORDER  BY 1;

I would like to be able to only calculate the average for the lowest say 20 the_value's in each bin. From other posts here on Stackoverflow I have seen that this is possible and that perhaps ORDER BY the_value and rank() is the best way to go about it. But my struggle is that I'm not sure where my current query should be modified to implement this.

Any insight would be appreciated.

Postgres version 9.3


Solution

  • Use row_number() on each bin.
    First compute the row number rn, then apply WHERE rn < 21 in the next step:

    WITH first AS (
       SELECT extract(EPOCH FROM foo.t_date) AS the_date
            , foo_val AS the_value
       FROM bar
       JOIN foo ON foo.user_id = bar.x_id
               AND foo.user_name = 'xxxx'
       )
    , x AS (
       SELECT MIN(the_date) AS min_epoch
            , MAX(the_date) AS max_epoch
       FROM  first
       )
    , y AS (
       SELECT width_bucket(f.the_date, x.min_epoch, x.max_epoch, 4) AS bin, *
       FROM   first f, x
       )
    , z AS (
       SELECT row_number() OVER (PARTITION BY bin ORDER BY the_value) AS rn, *
       FROM   y
       )
    SELECT bin, round(sum(bin_sum) OVER w / sum(bin_ct) OVER w, 2) AS running_avg
    FROM  (
       SELECT bin
            , sum(the_value) AS bin_sum
            , count(*)       AS bin_ct
       FROM   z
       WHERE  rn < 21   -- max 20 lowest values
       GROUP  BY 1
       ) sub
    WINDOW w AS (ORDER BY bin)
    ORDER  BY 1;
    

    CTEs y and z could be conflated. Similarly first and x could be conflated.
    But it's clearer that way.

    Untested, since we don't have test data.