Search code examples
sqlsql-serverpostgresqlaggregate-functions

Understanding multiple GROUPING SETS


I am clear on how GROUPING SETS (and ROLLUP and CUBE) work for a single expression. However, I've never entirely understood how it works when multiple GROUPING SETS are combined. Here is an example table that I've created to help with this question:

CREATE TABLE movies AS (
    SELECT 'Black Widow' Movie, 'Disney' Studio, 2021 AS Year, 226583885 Revenue UNION ALL
    SELECT 'Black Widow', 'Disney', 2022, 126583885 UNION ALL
    SELECT 'Black Widow', 'Disney', 2023, 26583885 UNION ALL
    SELECT 'Spider-man: No Way Home', 'Sony', 2021, 740615703 UNION ALL
    SELECT 'Spider-man: No Way Home', 'Sony', 2022, 640615703 UNION ALL
    SELECT 'Spider-man: No Way Home', 'Sony', 2023, 540615703 UNION ALL
    SELECT 'Top Gun: Maverick', 'Paramount', 2022, 847848146 UNION ALL
    SELECT 'Top Gun: Maverick', 'Paramount', 2023, 647848146 UNION ALL
    SELECT 'The Batman', 'Warner Bros.', 2022, 486122791 UNION ALL
    SELECT 'The Batman', 'Warner Bros.', 2023, 286122791 UNION ALL
    SELECT 'Barbie', 'Warner Bros.', 2023, 1441769400 UNION ALL
    SELECT 'Oppenheimer', 'NBCUniversal', 2023, 950205530
)

If I repeat a GROUP BY element that is not part of a GROUPING SETS|ROLLUP|CUBE, it seems to have no effect, for example:

select row_number() over () num, studio, sum(revenue) from movies group by studio;
┌──────────────────────┬──────────────┬──────────────┐
│ num                  ┆ Studio       ┆ sum(revenue) │
╞══════════════════════╪══════════════╪══════════════╡
│                    1 ┆ Disney       ┆    379751655 │
│                    2 ┆ Sony         ┆   1921847109 │
│                    3 ┆ Paramount    ┆   1495696292 │
│                    4 ┆ Warner Bros. ┆   2214014982 │
│                    5 ┆ NBCUniversal ┆    950205530 │
└──────────────────────┴──────────────┴──────────────┘

Is the same as:

select row_number() over () num, studio, sum(revenue) from movies group by studio, studio, studio;
┌──────────────────────┬──────────────┬──────────────┐
│ num                  ┆ Studio       ┆ sum(revenue) │
╞══════════════════════╪══════════════╪══════════════╡
│                    1 ┆ Disney       ┆    379751655 │
│                    2 ┆ Sony         ┆   1921847109 │
│                    3 ┆ Paramount    ┆   1495696292 │
│                    4 ┆ Warner Bros. ┆   2214014982 │
│                    5 ┆ NBCUniversal ┆    950205530 │
└──────────────────────┴──────────────┴──────────────┘

That seems straightforward. However, when I add in a GROUPING SETS, it does change things when elements are repeated:

select row_number() over () num, studio, sum(revenue) from movies 
group by grouping sets(studio, ());
┌─────┬──────────────┬──────────────┐
│ num ┆ Studio       ┆ sum(revenue) │
╞═════╪══════════════╪══════════════╡
│   1 ┆ Disney       ┆    379751655 │
│   2 ┆ Sony         ┆   1921847109 │
│   3 ┆ Paramount    ┆   1495696292 │
│   4 ┆ Warner Bros. ┆   2214014982 │
│   5 ┆ NBCUniversal ┆    950205530 │
│   6 ┆              ┆   6961515568 │
└─────┴──────────────┴──────────────┘
Elapsed: 3 ms

select row_number() over () num, studio, sum(revenue) from movies 
group by grouping sets(studio, ()), grouping sets(studio, ());
┌─────┬──────────────┬──────────────┐
│ num ┆ Studio       ┆ sum(revenue) │
╞═════╪══════════════╪══════════════╡
│   1 ┆ Disney       ┆    379751655 │
│   2 ┆ Sony         ┆   1921847109 │
│   3 ┆ Paramount    ┆   1495696292 │
│   4 ┆ Warner Bros. ┆   2214014982 │
│   5 ┆ NBCUniversal ┆    950205530 │
│   6 ┆ Disney       ┆    379751655 │
│   7 ┆ Sony         ┆   1921847109 │
│   8 ┆ Paramount    ┆   1495696292 │
│   9 ┆ Warner Bros. ┆   2214014982 │
│  10 ┆ NBCUniversal ┆    950205530 │
│  11 ┆ Disney       ┆    379751655 │
│  12 ┆ Sony         ┆   1921847109 │
│  13 ┆ Paramount    ┆   1495696292 │
│  14 ┆ Warner Bros. ┆   2214014982 │
│  15 ┆ NBCUniversal ┆    950205530 │
│  16 ┆              ┆   6961515568 │
└─────┴──────────────┴──────────────┘
Elapsed: 2 ms

How then is the GROUPING SETS for multiple items done? For example, if the equivalent UNION-ed clause for GROUP BY GROUPING SETS(studio, ()) is:

                                                                        -- GROUPING SETS(
SELECT studio, SUM(revenue) FROM movies GROUP BY studio UNION ALL       --   studio,
SELECT NULL  , SUM(revenue) FROM movies                                 --   ()
                                                                        -- )

Then what would be the equivalent UNION ALL if there are multiple items in the GROUP BY ?


Note: the above queries have been tested in both Postgres -- https://www.db-fiddle.com/f/hvrNtMSrAe7UMz6dTn33Y9/0 -- and SQL Server (though I was getting arithmetic overflow so I modified the numbers a bit) https://dbfiddle.uk/LW3fXIIo.


Solution

  • From the documentation:

    If multiple grouping items are specified in a single GROUP BY clause, then the final list of grouping sets is the cross product of the individual items.

    So

    group by grouping sets(studio, ()), grouping sets(studio, ())
    

    is equivalent to

    group by grouping sets (
      (studio), -- (studio x studio, simplifies to just studio)
      (studio), -- (studio x ())
      (studio), -- (() x studio)
      (),       -- (() x ())
    )
    

    Note that you can use

    group by distinct grouping sets(...)
    

    in case you want to remove duplicate grouping sets (possibly useful when dealing with complex rules and/or having multiple rollup/cube in a single group by).