Search code examples
sqlsql-serverjoinaggregate-functionssql-server-2017

SQL Server join on result of aggregate function


SQL Server 2017. I have 2 tables, with hacker name and id, and another coding challenges submitted by each (below). I need to output id, name and count of challenges, filtering out those hackers who submitted the same number of challenges, except when this number is a maximum.
Here is sample data and final output i need

Hackers:

hacker_id name
1 john
2 tom
3 anna
4 mary
5 steve

Challenges:

challenge_id    hacker_id
1   1
2   1
3   1
4   2
5   2
6   2
7   2
8   3
9   3
10  3
11  4
12  4
13  4
14  4
15  5
16  5

this is the count of challenges per person (from here, we see that max number is 4 per person) :

hacker_id   name    count of challenges
1   john    3
2   tom     4
3   anna    3
4   mary    4
5   steve   2

The final output would be as follows:

hacker_id   name    count of challenges
2   tom     4
4   mary    4
5   steve   2

i.e. tom and mary both submitted 4 challenges. They are included because although the number 4 repeats, it's a maximum John and Anna submitted 3 each. They are excluded because 3 is not a max per person. Steve submitted 2 and this number is unique, so he's included, too.

Here is my code:

SELECT h.hacker_id, 
h.name, 
COUNT(c.challenge_id) AS ChalCountPerHead
FROM    hackers h 
        JOIN challenges c ON h.hacker_id = c.hacker_id
        LEFT JOIN ( 
            SELECT d.FreqHacker, COUNT(d.FreqHacker) as FreqOfFreq FROM 
                (SELECT hacker_id, COUNT(challenge_id) AS FreqHacker 
                 FROM Challenges GROUP BY hacker_id) d
            GROUP BY d.FreqHacker
        ) dd
        ON FreqHacker = COUNT(c.challenge_id)
GROUP BY h.hacker_id, h.name
HAVING 
COUNT(c.challenge_id) = (SELECT MAX(d.FreqHacker) from d) 
OR
dd.FreqOfFreq = 1

It doesn't work, stating an error message on this line

ON FreqHacker = COUNT(c.challenge_id)

An aggregate cannot appear in an ON clause unless it is in a sub query contained in a HAVING clause or select list, and the column being aggregated is an outer reference.


Solution

  • Here is one way to do it.

    Having sample data in the question make it way easier to verify the solution. Please include it next time.

    CTE is a simple aggregation to get the number of challenges submitted by each hacker.

    In CTE2 MAX gives the global maximum frequency. HackerCountOfSameFreq is the count of hackers that have the same frequency.

    Final WHERE removes groups of hackers that consist of more than 1 hacker, but leaves the group that has the maximum frequency.

    Sample data

    DECLARE @Hackers TABLE (hacker_id int, name varchar(50));
    INSERT INTO @Hackers VALUES
    (1, 'john'),
    (2, 'tom'),
    (3, 'anna'),
    (4, 'mary'),
    (5, 'steve');
    
    DECLARE @Challenges TABLE (challenge_id int, hacker_id int);
    INSERT INTO @Challenges VALUES
    (1 ,  1),
    (2 ,  1),
    (3 ,  1),
    (4 ,  2),
    (5 ,  2),
    (6 ,  2),
    (7 ,  2),
    (8 ,  3),
    (9 ,  3),
    (10,  3),
    (11,  4),
    (12,  4),
    (13,  4),
    (14,  4),
    (15,  5),
    (16,  5);
    

    Query

    WITH
    CTE
    AS
    (
        SELECT hacker_id, COUNT(*) AS FreqHacker
        FROM @Challenges
        GROUP BY hacker_id
    )
    ,CTE2
    AS
    (
        SELECT
            hacker_id
            ,FreqHacker
            ,COUNT(*) OVER (PARTITION BY FreqHacker) AS HackerCountOfSameFreq
            ,MAX(FreqHacker) OVER () AS GlobalMaxFreq
        FROM CTE
    )
    
    SELECT
        CTE2.hacker_id
        ,CTE2.FreqHacker
        ,H.Name
    FROM
        CTE2
        INNER JOIN @Hackers AS H ON H.hacker_id = CTE2.hacker_id
    WHERE
        HackerCountOfSameFreq = 1
        OR FreqHacker = GlobalMaxFreq
    ORDER BY
        CTE2.hacker_id
    ;
    

    Result

    +-----------+------------+-------+
    | hacker_id | FreqHacker | Name  |
    +-----------+------------+-------+
    | 2         | 4          | tom   |
    +-----------+------------+-------+
    | 4         | 4          | mary  |
    +-----------+------------+-------+
    | 5         | 2          | steve |
    +-----------+------------+-------+
    

    Your query also produces correct result (at least with this sample data), once the syntax is fixed.

    I've split it into CTEs, leaving most of your logic as is:

    WITH
    d
    AS
    (
        SELECT hacker_id, COUNT(challenge_id) AS FreqHacker 
        FROM @Challenges 
        GROUP BY hacker_id
    )
    ,dd
    AS
    (
        SELECT d.FreqHacker, COUNT(d.FreqHacker) as FreqOfFreq 
        FROM d
        GROUP BY d.FreqHacker
    )
    ,d3
    AS
    (
        SELECT 
            h.hacker_id, 
            h.name, 
            COUNT(c.challenge_id) AS ChalCountPerHead
        FROM    
            @hackers h 
            JOIN @challenges c ON h.hacker_id = c.hacker_id
        GROUP BY h.hacker_id, h.name
    )
    ,d4
    AS
    (
        SELECT *
        FROM 
            d3
            LEFT JOIN dd ON dd.FreqHacker = ChalCountPerHead
    )
    SELECT *
    FROM d4
    WHERE
        ChalCountPerHead = (SELECT MAX(d.FreqHacker) from d)
        OR FreqOfFreq = 1
    ORDER BY hacker_id
    ;