Search code examples
postgresqlaggregategreatest-n-per-groupwindow-functionsgaps-and-islands

How to include the first row from the next group in an aggregation?


I have a table in Postgres with categories and values. I want to perform an aggregation, e.g. avg(value) per category but including the first row of the next category in the aggregation.

Sample table:

id  category  value
-------------------
1   1         5.4
2   1         2.1
3   2         1.0
4   2         2.6
5   2         0.3
6   3         4.4
7   3         3.8

id is a primary key and provides an order. Categories are grouped within the order and consecutive.
It would be acceptable (but not necessary) to create an intermediate table like this, which is duplicating the adjacent row:

id  category  value
-------------------
1   1         5.4
2   1         2.1
3   1         1.0  <-- new row
4   2         1.0
5   2         2.6
6   2         0.3
7   2         4.4  <-- new row
8   3         4.4
9   3         3.8

... and then do:

select category, avg(value) group by category from sample_table

How can this be achieved with SQL statements?

I suspect this could be done with window functions and some sophisticated frame clause like GROUPS, but I don't know how. ( See https://www.postgresql.org/docs/12/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS )


Solution

  • You confirmed that category numbers are increasing by 1 steadily, no gaps.
    Here is a simple approach for the simple case:

    SELECT category, avg(value)
    FROM  (
       SELECT category, value
       FROM   tbl
    
       UNION ALL
       (  -- parentheses required
       SELECT DISTINCT ON (category)
              category - 1, value
       FROM   tbl
       WHERE  category > (SELECT min(category) FROM tbl)  -- eliminate corner case
       ORDER  BY category, id
       )   
       ) sub
    GROUP  BY 1
    ORDER  BY 1;
    

    The second term after UNION ALL adds the row like you suggested: I take the first row of each group and simply subtract 1 from the category.

    Corner case: adds a new category with min(category) - 1. Can be eliminated easily ...


    Generic solution for any kind of categories (as long as the order is defined):

    SELECT category, avg(value)
    FROM  (
       SELECT category, value
       FROM   tbl
    
       UNION ALL
       SELECT lag(category) OVER (ORDER BY category), value
       FROM  (
          SELECT DISTINCT ON (category)
                 category, value
          FROM   tbl
          ORDER  BY category, id
          ) unicat
       ) sub
    WHERE  category IS NOT NULL  -- eliminate corner case
    GROUP  BY 1
    ORDER  BY 1;
    

    The first value of each group is added to to the previous category using the window function lag().

    About DISTINCT ON: