Search code examples
mysqlsqlduplicatesquery-optimization

Need some help to clean duplicates in MySQL table which didn't have constraint


I've inherited some MySQL table that was designed without correct constraint so it gets filled with some duplicate rows which I need to remove. The problem that across duplicate rows data isn't usually consistent, see example below:

id request_id guid_id details flag
1 10 fh82EN help me 1
2 11 fh82EN
3 12 fh82EN assistance required 1
4 12 fh82EN assistance required 1
5 13 fh82EN
6 13 fh82EN assist me. 1
7 13 fh82EN
8 14 fh82EN

Records with id: 1,2,8 perfectly fine. For duplicate records with id 3, 4 I have designed the query below which works fine and removes all duplicates without an issue:

DELETE IR.*
FROM platform.temp IR
WHERE id IN (
    SELECT maxId AS id FROM (
        SELECT MAX(id) as maxId, request_id, guid_id
        FROM platform.temp
        GROUP BY request_id, guid_id
        HAVING COUNT(*) > 1
    ) AS T
);

The problem is records with id 5,6,7. You can see that the same record by (guid_id and request_id) is not consistent. So, my previous query will delete records with content too because of MAX(id). I have designed a query that fixes these records, but we are talking about a huge database and this query is painfully slow:

UPDATE platform.temp AS DEST_T
INNER JOIN (
    SELECT request_id, guid_id, details, flag FROM platform.temp WHERE details IS NOT NULL AND details != ''
) AS SOURCE_T
SET DEST_T.details = SOURCE_T.details, DEST_T.flag = SOURCE_T.flag
    WHERE DEST_T.guid_id = SOURCE_T.guid_id AND DEST_T.request_id = SOURCE_T.request_id;

How can I change my delete query that it will order my subgroup by field details and will select not MAX(id) but first id, so I will be sure that last row in subgroup will always be populated with value and will left?

MySQL version: 5.6.40-log

UPDATE1: The desired outcome after cleaning the table should be as follow:

id request_id guid_id details flag
1 10 fh82EN help me 1
2 11 fh82EN
3 12 fh82EN assistance required 1
6 13 fh82EN assist me. 1
8 14 fh82EN

Solution

  • Use a self join of the table:

    DELETE t1
    FROM tablename t1 INNER JOIN tablename t2
    ON t2.request_id = t1.request_id AND t2.guid_id = t1.guid_id
    WHERE (t2.id < t1.id AND COALESCE(t1.details, '') = '')
          OR
          (t2.id > t1.id AND COALESCE(t2.details, '') <> '');
    

    This will keep 1 row for each request_id and guid_id combination, not necessarily the one with the min id.

    See the demo.

    Another way to do it, with conditional aggregation:

    DELETE t1
    FROM tablename t1 INNER JOIN (
      SELECT request_id, guid_id,
             MIN(id) min_id,
             MIN(CASE WHEN COALESCE(details, '') <> '' THEN id END) min_id_not_null
      FROM tablename
      GROUP BY request_id, guid_id
    ) t2 ON t2.request_id = t1.request_id AND t2.guid_id = t1.guid_id
    WHERE t1.id <> COALESCE(t2.min_id_not_null, t2.min_id);
    

    This will keep the row with the min id under your conditions, but maybe its performance would not be that good compared to the 1st query.

    See the demo.