Search code examples
sqlmysqlaggregate-functionsgreatest-n-per-group

Return row holding the group-wise maximum of a certain column, account for duplicate values in column


I've been trying to report viewer stats for an event portal. I've noticed on many occasions people reconnect multiple times, so I've worked out a viewer_id to associate them together. Each time they start watching the event they enter a name and the number of people viewing (including themselves).

I would like to be to be able to select the event views grouped by event_id and viewer_id combination while choosing the row with the greatest viewer_count for that given combination.

Example Schema & Data

-- Server Version: MySQL 8.0.43
CREATE TABLE `event_viewers` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `event_id` bigint unsigned NOT NULL,
  `viewer_id` bigint unsigned NOT NULL,
  `name` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `viewer_count` int NOT NULL,
  PRIMARY KEY (`id`)
);
-- Event ID 1
insert into event_viewers (id, event_id, viewer_id, name, viewer_count)
values  (1, 1, 1, 'Bert Kuvalis0', 1),
        (6, 1, 2, 'Wanda Steuber0', 7),
        (11, 1, 3, 'Erick Nienow0', 4),
        (16, 1, 3, 'Erick Nienow1', 3),
        (17, 1, 3, 'Erick Nienow2', 4);
-- Event ID 2
insert into event_viewers (id, event_id, viewer_id, name, viewer_count)
values  (2, 2, 1, 'Bert Kuvalis2', 11),
        (7, 2, 2, 'Wanda Steuber2', 10),
        (12, 2, 3, 'Erick Nienow3', 7),
        (18, 2, 2, 'Wanda Steuber3', 13);

I would like to be able to get a result of:

id event_id viewer_id name viewer_count
1 1 1 Bert Kuvalis 1
6 1 2 Wanda Steuber0 7
11 1 3 Erick Nienow0 4
2 2 1 Bert Kuvalis2 11
18 2 2 Wanda Steuber3 13
12 2 3 Erick Nienow3 7

In the above results, there is a record with identical event_id, viewer_id & viewer_count that has the greatest viewer_count, I have resolved this by taking the first row that matches (id 11), an discarding the second (id 17). For my problem, I don't actually care which gets selected (11 or 17), as long as only one gets selected.

Things I've tried:

GROUP BY

My most successful try has been to use GROUP BY and MAX, but lacks the base row id and name.

SELECT
    ev.event_id,
    ev.viewer_id,
    MAX(ev.`viewer_count`) AS `viewer_count`
FROM event_viewers as ev
GROUP BY ev.viewer_id, ev.event_id ORDER BY `event_id`, `viewer_id`;

This returns the desired output with the exclusion of the id and name columns.

event_id viewer_id viewer_count
1 1 1
1 2 7
1 3 4
2 1 11
2 2 13
2 3 7

WHERE NOT EXISTS

I tried using WHERE NOT EXISTS to see if I could exclude the duplicates, preferring the largest.

SELECT DISTINCT ev1.* from event_viewers ev1
WHERE NOT EXISTS (
  SELECT * FROM event_viewers as ev2
  WHERE ev2.viewer_id = ev1.viewer_id
  AND ev2.event_id = ev1.event_id
  AND ev2.viewer_count > ev1.viewer_count
) ORDER BY `event_id`, `viewer_id`;

See below for result, as WHERE NOT EXISTS, LEFT JOIN and WITH WINDOW tries all have identical outputs. They are including an extra row that has a matching event_id, viewer_id and viewer_count it shows up twice. (both 11 and 17 showing, wanting only one of them).

LEFT JOIN

As recommended by the MySQL documentation on rows holding group-wise maximum of a certain column, I tried using a LEFT JOIN:

SELECT ev1.* FROM event_viewers ev1 
LEFT JOIN event_viewers ev2 
ON ( ev1.viewer_count<ev2.viewer_count AND ev1.viewer_id=ev2.viewer_id AND ev1.event_id=ev2.event_id )
WHERE ev2.id IS null
ORDER BY ev1.event_id, ev1.`viewer_id`;

See below for result, as WHERE NOT EXISTS, LEFT JOIN and WITH WINDOW tries all have identical outputs. They are including an extra row that has a matching event_id, viewer_id and viewer_count it shows up twice. (both 11 and 17 showing, wanting only one of them).

WITH WINDOW

As recommended by the MySQL documentation on rows holding group-wise maximum of a certain column, I attempted using a WITH:

WITH w1 AS (
    SELECT *,
           RANK() OVER (PARTITION BY viewer_id, event_id
               ORDER BY viewer_count DESC
               ) AS `Rank`
    FROM event_viewers
)
SELECT id, event_id, viewer_id, name, viewer_count
FROM w1
WHERE `Rank` = 1
ORDER BY `event_id`, `viewer_id`;

See below for result, as WHERE NOT EXISTS, LEFT JOIN and WITH WINDOW tries all have identical outputs. They are including an extra row that has a matching event_id, viewer_id and viewer_count it shows up twice. (both 11 and 17 showing, wanting only one of them).

id event_id viewer_id name viewer_count
1 1 1 Bert Kuvalis 1
6 1 2 Wanda Steuber0 7
11 1 3 Erick Nienow0 4
17 1 3 Erick Nienow2 4
2 2 1 Bert Kuvalis2 11
18 2 2 Wanda Steuber3 13
12 2 3 Erick Nienow3 7

Solution

  • You need some unique column to break ties, so you can eliminate the redundant row with viewer_count=4.

    Here's an example, where I added an ORDER BY ... id ASC to the window function then used ROW_NUMBER() instead of RANK():

    WITH w1 AS (
        SELECT *,
               ROW_NUMBER() OVER (
                   PARTITION BY viewer_id, event_id
                   ORDER BY viewer_count DESC, 
                            id ASC                   -- here
               ) AS ROWNUM
        FROM event_viewers
    )
    SELECT id, event_id, viewer_id, name, viewer_count
    FROM w1
    WHERE ROWNUM = 1
    ORDER BY `event_id`, `viewer_id`;