Search code examples
sqlsql-serverjoinaggregate-functionsgrouping-sets

Is it possible to use grouping sets to achieve this type of aggregation using multiple joins


I am able to get the correct results using UNION ALL - but have been trying to get this to work using GROUPING SETS without success. Is this possible?

Example SQL:

--Build data table
WITH TABLE_1 AS ( --data
SELECT 1 T1_ID, 2 VAL UNION
SELECT 2 T1_ID, 4 VAL UNION
SELECT 3 T1_ID, 6 VAL UNION
SELECT 4 T1_ID, 8 VAL UNION
SELECT 5 T1_ID, 10 VAL
),
TABLE_2 AS ( --first level join
SELECT 1 T1_ID, 'AA' T2_ID UNION
SELECT 2 T1_ID, 'AA' T2_ID UNION
SELECT 3 T1_ID, 'BB' T2_ID UNION
SELECT 4 T1_ID, 'BB' T2_ID UNION
SELECT 5 T1_ID, 'BB' T2_ID
),
TABLE_3 AS ( --second level join
SELECT 1 T1_ID, 'CCC' T3_ID UNION
SELECT 2 T1_ID, 'CCC' T3_ID UNION
SELECT 3 T1_ID, 'CCC' T3_ID UNION
SELECT 4 T1_ID, 'CCC' T3_ID UNION
SELECT 5 T1_ID, 'CCC' T3_ID UNION
SELECT 1 T1_ID, 'DDD' T3_ID UNION
SELECT 2 T1_ID, 'DDD' T3_ID UNION
SELECT 3 T1_ID, 'DDD' T3_ID UNION
SELECT 4 T1_ID, 'DDD' T3_ID UNION
SELECT 5 T1_ID, 'DDD' T3_ID 
)

This is what the data table looks like fully joined -


SELECT 
TABLE_1.T1_ID, 
TABLE_1.VAL, 
TABLE_2.T2_ID,
TABLE_3.T3_ID
FROM TABLE_1
    LEFT JOIN TABLE_2 ON TABLE_2.T1_ID = TABLE_1.T1_ID
    LEFT JOIN TABLE_3 ON TABLE_3.T1_ID = TABLE_1.T1_ID

T1_ID   VAL  T2_ID  T3_ID
1       2    AA     CCC
1       2    AA     DDD
2       4    AA     CCC
2       4    AA     DDD
3       6    BB     CCC
3       6    BB     DDD
4       8    BB     CCC
4       8    BB     DDD
5       10   BB     CCC
5       10   BB     DDD

--This gives me the results I need (Sum VAL by distinct AA, BB, AA+BB, nulls) -

SELECT 
    T2_ID,
    NULL T3_ID,
    SUM(VAL) TOTAL
FROM TABLE_1
    LEFT JOIN TABLE_2 ON TABLE_2.T1_ID = TABLE_1.T1_ID
GROUP BY T2_ID
UNION ALL
SELECT 
    NULL T2_ID, 
    T3_ID,
    SUM(VAL) TOTAL
FROM TABLE_1
    LEFT JOIN TABLE_3 ON TABLE_3.T1_ID = TABLE_1.T1_ID
GROUP BY T3_ID
UNION ALL
SELECT 
    T2_ID,
    T3_ID,
    SUM(VAL) TOTAL
FROM TABLE_1
    LEFT JOIN TABLE_2 ON TABLE_2.T1_ID = TABLE_1.T1_ID
    LEFT JOIN TABLE_3 ON TABLE_3.T1_ID = TABLE_1.T1_ID
GROUP BY T2_ID, T3_ID
UNION ALL
    SELECT 
    NULL T2_ID,
    NULL T3_ID,
    SUM(VAL) TOTAL
FROM TABLE_1

CORRECT RESULTS:

T2_ID   T3_ID   TOTAL
AA      NULL    6
BB      NULL    24
NULL    CCC     30
NULL    DDD     30
AA      CCC     6
BB      CCC     24
AA      DDD     6
BB      DDD     24
NULL    NULL    30

When I try using GROUPING SETS -

SELECT 
    GROUPING_ID (T2_ID, T3_ID) GRP_ID,
    T2_ID,
    T3_ID,
    SUM(VAL) TOTAL
FROM TABLE_1
    LEFT JOIN TABLE_2 ON TABLE_2.T1_ID = TABLE_1.T1_ID
    LEFT JOIN TABLE_3 ON TABLE_3.T1_ID = TABLE_1.T1_ID
GROUP BY 
    GROUPING SETS ((),
            T2_ID,
            T3_ID,
            (T2_ID,T3_ID))

My results are doubling where GRP_ID = 1 or 3 shown below

-----------------------------------------
GRP_ID  T2_ID   T3_ID   TOTAL
0       AA      CCC     6
0       BB      CCC     24
2       NULL    CCC     30
0       AA      DDD     6
0       BB      DDD     24
2       NULL    DDD     30
1       AA      NULL    12
1       BB      NULL    48
3       NULL    NULL    60

UNION ALL approach works - but I keep thinking there should be a way to use GROUPING SETS.

EDIT: I need to also add the reason why I'm looking for a grouping set solution - the real-world version of this problem is not limited to two joins used to drive the group conditions (this can come from many joins), and any of them can end up duplicating rows like the TABLE_3 in the example. With this, the real-world grouping would look more like this, for example:

   grouping sets ((),
                  (a, b)
                  (a, c)
                  (a, d)
                  (a, b, c)
                  (a, b, c, d)
                  a, 
                  b, 
                  c, 
                  d)
                  

Solution

  • CORRECT ANSWER - this works for N grouping levels and combinations -

    The "SUB" CTE is needed to assign TABLE_1 row value for each individual grouping level, so that the summarizations are correct for each group level, which is done in the main/final query.

    But there is a performance hit after all; with my actual data set, the UNION ALL was completing in 17 seconds - the GROUPING SETS option in 33 seconds.

    SUB AS (
        SELECT 
            GROUPING_ID (TABLE_2.T2_ID, TABLE_3.T3_ID) GROUP_ID,
            TABLE_1.T1_ID,
            TABLE_2.T2_ID, 
            TABLE_3.T3_ID,
            MAX(TABLE_1.VAL) VAL
        FROM TABLE_1
            LEFT JOIN TABLE_2 ON TABLE_2.T1_ID = TABLE_1.T1_ID
            LEFT JOIN TABLE_3 ON TABLE_3.T1_ID = TABLE_1.T1_ID
        GROUP BY
            GROUPING SETS((),
                    TABLE_2.T2_ID,
                    TABLE_3.T3_ID,  
                    (TABLE_2.T2_ID, TABLE_3.T3_ID)
                    ),
            TABLE_1.T1_ID
    )
    
    --FINAL
    SELECT 
        GROUP_ID,
        T2_ID, 
        T3_ID,
        SUM(VAL) TOTAL_VAL
    FROM SUB
    GROUP BY
        GROUP_ID,
        T2_ID, 
        T3_ID
    
    

    (OLD ANSWER) - gets the same result as example table, BUT when other joins multiply the rows, it is not correct.

    SELECT 
        GRP_ID,
        T2_ID,
        T3_ID,
        SUM(TOTAL)
    FROM (
        SELECT DISTINCT
            GROUPING_ID (T2_ID, T3_ID) GRP_ID,
            SEMI_FINAL.T2_ID,
            SEMI_FINAL.T3_ID,
            SUM(VAL) TOTAL
        FROM (
            SELECT TABLE_1.*, T2_ID, T3_ID,
                RANK() OVER (ORDER BY TABLE_2.T2_ID) RANK2,
                RANK() OVER (ORDER BY TABLE_3.T3_ID) RANK3
            FROM TABLE_1
                LEFT JOIN TABLE_2 ON TABLE_2.T1_ID = TABLE_1.T1_ID
                LEFT JOIN TABLE_3 ON TABLE_3.T1_ID = TABLE_1.T1_ID
        ) SEMI_FINAL
        GROUP BY 
            GROUPING SETS ((),
                        SEMI_FINAL.T3_ID,
                        SEMI_FINAL.T2_ID,
                        (SEMI_FINAL.T2_ID, SEMI_FINAL.T3_ID))
                        , RANK2
                        , RANK3
     ) FINAL_TABLE          
    GROUP BY 
        GRP_ID,
        T3_ID,
        T2_ID
    

    And this returns the correct results -

    GRP_ID  T2_ID   T3_ID   TOTAL
    0       AA      CCC     6
    0       BB      CCC     24
    0       AA      DDD     6
    0       BB      DDD     24
    1       AA      NULL    6
    1       BB      NULL    24
    2       NULL    CCC     30
    2       NULL    DDD     30
    3       NULL    NULL    30