Search code examples
sqlpostgresqlamazon-redshift

Redshift GROUP BY ALL syntax not supported


Moving to Redshift from another provider, I see that GROUP BY ALL (which groups by all non-aggregate columns in the SELECT statement) syntax is not supported. This was super useful for me when I had loads of non-aggregate columns and didn't want to manually type out GROUP BY 1,2,3,4... etc.

I was wondering if there is in fact a similar command on Redshift or is it simply a limitation of PostgreSQL?


Solution

  • There is a group by all in both Redshift and Postgres but that's not what it does. You might be interested in group by cube, rollup or grouping sets. Postgres has them, Redshift has them too.

    Unfortunately, there's no syntax for grouping by everything except aggregate expressions detected in the select list. There kind of is, but that's not entirely equivalent, and not as handy - see example at the end.

    The group by all is default behaviour in both PostgreSQL and Redshift, the alternative being group by distinct which removes duplicate grouping sets. Quoting the doc:

    When specifying multiple grouping items together, the final set of grouping sets might contain duplicates. For example:

    GROUP BY ROLLUP (a, b), ROLLUP (a, c)
    

    is equivalent to

    GROUP BY GROUPING SETS (
       (a, b, c),
       (a, b),
       (a, b),
       (a, c),
       (a),
       (a),
       (a, c),
       (a),
       ()
    )
    

    If these duplicates are undesirable, they can be removed using the DISTINCT clause directly on the GROUP BY. Therefore:

    GROUP BY DISTINCT ROLLUP (a, b), ROLLUP (a, c)
    

    is equivalent to

    GROUP BY GROUPING SETS (
       (a, b, c),
       (a, b),
       (a, c),
       (a),
       ()
    )
    

    It's similar to how select is actually select all unless you make it select distinct. Again, the doc:

    If SELECT DISTINCT is specified, all duplicate rows are removed from the result set (one row is kept from each group of duplicates). SELECT ALL specifies the opposite: all rows are kept; that is the default.


    If there's a join, you can use a group by tablename.* to make all columns of the table your grouping set but

    1. You still need to list them all in the join conditions. The exception is if all their names match and you're willing to risk a natural join.
    2. You can't display them next to the aggregate expressions because they form a single grouping set. You can use grouping() function, but it'll just show that all rows belong to the 0th grouping set.

    demo1 at db-fiddle

    select sum(test_vals.v) 
    from test
    natural join test_vals
    group by all test.*;
    
    sum
    774.07
    2114.39
    806.75
    966.23
    1644.51

    The scenario I think @Lajos Arpad is referring to is this:
    demo2 at db-fiddle

    select test.*,sum(v)
    from test
    join test_vals using(a)
    group by test.a;
    
    Error: column "test.b" must appear in the GROUP BY clause or be used in an aggregate function
    

    Suggested addition of b to the group by list makes that go away only to be replaced by a request to do something about c and all following columns. That's no longer the case if a is a primary key - you can group by only the PK, skipping the rest:

    alter table test add primary key(a);
    
    select test.*,sum(v)
    from test
    join test_vals using(a)
    group by test.a;
    
    a b c sum
    1 7 3 4.42
    2 8 1 3.50
    3 1 2 5.18

    You can find this, once more, in the doc:

    If the products table is set up so that, say, product_id is the primary key, then it would be enough to group by product_id in the above example, since name and price would be functionally dependent on the product ID, and so there would be no ambiguity about which name and price value to return for each product ID group.