I've inherited some MySQL table that was designed without correct constraint so it gets filled with some duplicate rows which I need to remove. The problem that across duplicate rows data isn't usually consistent, see example below:
id | request_id | guid_id | details | flag |
---|---|---|---|---|
1 | 10 | fh82EN | help me | 1 |
2 | 11 | fh82EN | ||
3 | 12 | fh82EN | assistance required | 1 |
4 | 12 | fh82EN | assistance required | 1 |
5 | 13 | fh82EN | ||
6 | 13 | fh82EN | assist me. | 1 |
7 | 13 | fh82EN | ||
8 | 14 | fh82EN | ||
Records with id: 1,2,8 perfectly fine. For duplicate records with id 3, 4 I have designed the query below which works fine and removes all duplicates without an issue:
DELETE IR.*
FROM platform.temp IR
WHERE id IN (
SELECT maxId AS id FROM (
SELECT MAX(id) as maxId, request_id, guid_id
FROM platform.temp
GROUP BY request_id, guid_id
HAVING COUNT(*) > 1
) AS T
);
The problem is records with id 5,6,7. You can see that the same record by (guid_id and request_id) is not consistent. So, my previous query will delete records with content too because of MAX(id)
. I have designed a query that fixes these records, but we are talking about a huge database and this query is painfully slow:
UPDATE platform.temp AS DEST_T
INNER JOIN (
SELECT request_id, guid_id, details, flag FROM platform.temp WHERE details IS NOT NULL AND details != ''
) AS SOURCE_T
SET DEST_T.details = SOURCE_T.details, DEST_T.flag = SOURCE_T.flag
WHERE DEST_T.guid_id = SOURCE_T.guid_id AND DEST_T.request_id = SOURCE_T.request_id;
How can I change my delete query that it will order my subgroup by field details
and will select not MAX(id)
but first id, so I will be sure that last row in subgroup will always be populated with value and will left?
MySQL version: 5.6.40-log
UPDATE1: The desired outcome after cleaning the table should be as follow:
id | request_id | guid_id | details | flag |
---|---|---|---|---|
1 | 10 | fh82EN | help me | 1 |
2 | 11 | fh82EN | ||
3 | 12 | fh82EN | assistance required | 1 |
6 | 13 | fh82EN | assist me. | 1 |
8 | 14 | fh82EN | ||
Use a self join of the table:
DELETE t1
FROM tablename t1 INNER JOIN tablename t2
ON t2.request_id = t1.request_id AND t2.guid_id = t1.guid_id
WHERE (t2.id < t1.id AND COALESCE(t1.details, '') = '')
OR
(t2.id > t1.id AND COALESCE(t2.details, '') <> '');
This will keep 1 row for each request_id
and guid_id
combination, not necessarily the one with the min id
.
See the demo.
Another way to do it, with conditional aggregation:
DELETE t1
FROM tablename t1 INNER JOIN (
SELECT request_id, guid_id,
MIN(id) min_id,
MIN(CASE WHEN COALESCE(details, '') <> '' THEN id END) min_id_not_null
FROM tablename
GROUP BY request_id, guid_id
) t2 ON t2.request_id = t1.request_id AND t2.guid_id = t1.guid_id
WHERE t1.id <> COALESCE(t2.min_id_not_null, t2.min_id);
This will keep the row with the min id
under your conditions, but maybe its performance would not be that good compared to the 1st query.
See the demo.