Search code examples
mysqlsqlinner-joinsql-deleteself-join

MySQL use DELETE FROM to remove duplicates rows


I'm learning MySQL and today I tried to solve an MySQL question on leetcode: https://leetcode.com/problems/delete-duplicate-emails/solution/

+----+------------------+
| Id | Email            |
+----+------------------+
| 1  | [email protected] |
| 2  | [email protected]  |
| 3  | [email protected] |
+----+------------------+
Id is the primary key column for this table.

The basic idea is to remove duplicate rows from the table and only keep the one with the smallest Id. Expect results:

+----+------------------+
| Id | Email            |
+----+------------------+
| 1  | [email protected] |
| 2  | [email protected]  |
+----+------------------+

The solution is

DELETE p1 FROM Person p1,Person p2 
WHERE
p1.Email = p2.Email AND p1.Id > p2.Id

And I tried a second test case myself:

+----+------------------+
| Id | Email            |
+----+------------------+
| 1  | [email protected] |
| 2  | [email protected]  |
| 3  | [email protected] |
| 4  | [email protected] |
+----+------------------+

The output is:

+----+------------------+
| Id | Email            |
+----+------------------+
| 1  | [email protected] |
| 2  | [email protected]  |
+----+------------------+

I don't understand why the condition p1.Id > p2.Id only keeps the row with the smallest id because I think it only rules out the smaller one, it wouldn't apply to the case where there were more than 2 duplicates, right? But as the second test case shows, it indeed works. I think it's because of the syntax of

DELETE Table1
FROM Table1
INNER JOIN Table2 ON Table1.ID = Table2.ID

How does this work exactly? Can someone gives me an explanation here, thanks!


Solution

  • First, this is a very bad way of implementing this code. But I guess you get what you pay for.

    Second, simply run the query as a select:

    SELECT p1.*, p2.*
    FROM Person p1 JOIN
         Person p2 
         ON p1.Email = p2.Email AND p1.Id > p2.Id;
    

    (Note that I've rewritten the logic as a JOIN. You should always use proper, explicit, standard, readable JOIN syntax, but the two methods are functionally equivalent.)

    On your second example, the results of this query are:

    table1 email     table1 id    table2 id
    [email protected].    2            1
    [email protected].    3            1
    [email protected].    3            2
    

    What is notable is that id = 1 is never in the second column -- and that is the column that determines which ids are deleted. In other words, all but the smallest id for each email get deleted because there is a smaller id.

    This also hints at why this is a really bad solution. MySQL has to deal with two rows for id = 3. Perhaps it attempts to delete both. Perhaps it has to just deal with extra data. Either way, there is extra work. And the more rows with the same email in the data the more extra duplicates are created.

    An alternative method, such as:

    delete p
        from person p join
             (select email, min(id) as min_id
              from person p2
              group by email
             ) p2
             on p.email = p2.email and p.id > p2.min_id;
    

    Does not have this problem and, in my opinion, the intent is clearer.