Search code examples
sqlsortinggroup-byaggregatesql-order-by

SQL group by for aggregate with sorted input?


This is a question about SQL (the standard) as much as about PostgreSQL.

If we have an aggregate function that depends on the order of applying it, such as, for example, some concatenate aggregator (in PgSQL there is string_agg, or array_agg) and in Oracle there is list_agg or something, then the order that the rows are fed into the aggregation matters.

It would seem that the SQL standard for that should have some sort of GROUP BY ... ORDERED BY ... clause as fictitiously written like this:

SELECT groupKey,
       string_agg(col, '; ') AS col
  FROM Tab
  GROUP BY groupKey ORDERING BY sortKey

which ORDERING BY is different from the normal ORDER BY clause which sorts the result of the group aggregate operation. But that doesn't exist (and ChatGPT also has no idea how to do this directly.

We can use the window mode of calling the aggregator with

WITH ExcessiveAggregateWorkProduct AS (
  SELECT groupKey,
         row_number() OVER w AS rownum,
         string_agg(col, '; ') OVER w AS col
    FROM Tab
    WINDOW w AS (PARTITION BY groupKey ORDER BY sortKey
                   ROWS BETWEEN 0 PRECEDING AND UNBOUNDED FOLLOWING)
) SELECT * FROM ExcessiveAggregateWorkProduct 
   WHERE rownum = 1

But this is expensive because that window aggregate needs to be computed for every row only to throw away all but the first.

So I often just do this:

WITH SortedTab AS (
  SELECT groupKey, col
    FROM Tab
    ORDER BY groupKey, sortKey
) SELECT groupKey,
         string_agg(col, '; ') AS col
    FROM SortedTab
    GROUP BY groupKey;

and usually the query plan tells me that it should work. But there is no guarantee.

I would like to know how it can be guaranteed or how that excessive aggregation could be limited up front or how the RDBMS engine is supposed to optimize that away?


Solution

  • The answer is

    SELECT groupKey,
           string_agg(col, '; ' ORDER BY sortKey) AS col
      FROM Tab
      GROUP BY groupKey 
    

    Even ChatGPT didn't know that ;)

    The interesting thing is, in PostgreSQL where CREATE AGGREGATE is totally easy, you do not need to specify anything special to allow such order by to work. For example:

    CREATE AGGREGATE tsvector_agg(
      sfunc = tsvector_concat, 
      basetype = tsvector, 
      stype = tsvector
    );
    

    and immediately you can use tsvector_agg(... ORDER BY sortKey)