Search code examples
sqlmysql

SELECT duplicate values excluding most frequently used


I have two relevant tables in MySQL that look something like this:

marker table

id SEASONCD ITEMCD PRICETYPECD
1 foo bar baz
2 foo bar baz
3 foo bar baz
4 qux bar baz
5 qux bar baz
6 spam eggs ham

seat_marker table

id MARKER_ID
1 1
2 1
3 1
4 6

I would like to select all of the rows in the marker table that are duplicates, excluding the one that is most frequently referenced / joined by the seat_marker table. In other words, I would like to list all duplicates excluding the "original", where "original" is defined as the row with the most occurrences in seat_marker.MARKER_ID.

The result would look like this:

id SEASONCD ITEMCD PRICETYPECD
2 foo bar baz
3 foo bar baz
5 qux bar baz

We can see this is the original data from the marker table, but with three rows removed;

  • id 1 was removed because it was the foo-bar-baz duplicate that occurred more frequently in seat_marker than the other foo-bar-baz rows.
  • id 4 was removed because it was a duplicate. Since the qux-bar-baz rows show up 0 times in seat_marker, I don't have a preference if we remove id 4 or id 5. I chose id 4 at random.
  • id 6 was removed because it is not a duplicate

The closest I've been able to get is this, which lists all values that are a duplicate and the number of assignments it has. But does not exclude the "originals".

-- Select all duplicates, and list the number of times it's referenced in the seat_marker table
SELECT 
  n1.id, n1.SEASONCD, n1.ITEMCD, n1.PRICETYPECD, (
    SELECT COUNT(1) 
    FROM seat_marker 
    WHERE MARKER_ID = n1.id
  ) active_assignments
FROM marker n1, marker n2 
WHERE
    n1.id <> n2.id AND 
    n1.SEASONCD = n2.SEASONCD AND 
    n1.ITEMCD = n2.ITEMCD AND
    n1.PRICETYPECD = n2.PRICETYPECD
GROUP BY id, SEASONCD, ITEMCD, PRICETYPECD, active_assignments
ORDER BY SEASONCD, ITEMCD, PRICETYPECD, active_assignments DESC, id;

Solution

  • The expected result included in your question appears to be incorrect. It should be displaying 4 instead of 5, as 5 has one corresponding row in seat_marker and is therefor the "row with the most occurrences".

    One way to do this is with COUNT and ROW_NUMBER():

    WITH cte AS (
        SELECT m.*,
            ROW_NUMBER() OVER (PARTITION BY SEASONCD, ITEMCD, PRICETYPECD ORDER BY COUNT(sm.id) DESC, m.id) AS rn
        FROM marker m
        LEFT JOIN seat_marker sm ON m.id = sm.MARKER_ID
        GROUP BY m.id
    )
    SELECT * FROM cte WHERE rn > 1;
    

    Output:

    id SEASONCD ITEMCD PRICETYPECD rn
    2 foo bar baz 2
    3 foo bar baz 3
    4 qux bar baz 2

    Here's a db<>fiddle.