Search code examples
sqlpostgresqlsql-delete

delete duplicate records with "in"


I have a table like this:

CREATE TABLE foo (
  id SERIAL,
  a INTEGER,
  b INTEGER,
  some DOUBLE,
  other VARCHAR,
  data INTEGER
);

And I know that (a, b) must be unique, but for some reasons, this is not enforced by the database. Suppose I don't care about further discrimination, I just wipe duplicates from time to time, and maybe only keep the most recent one (with the biggest ID):

DELETE FROM foo WHERE id NOT IN (SELECT max(id) FROM foo GROUP BY a, b);

Now suppose that there are several 100.000 records, the IN (...) becomes pretty large.

I read about self-joins (like: DELETE FROM foo a LEFT JOIN foo b ON a.a = b.a AND a.b = b.b WHERE b.id < a.id), but this also means that I get huge intermediate tables if I have many duplicates.

What alternatives do I have?


Solution

  • If you want to delete older duplicate values, you can use:

    delete from foo
        where foo.id < (select max(foo2.id)
                        from foo foo2
                        where foo2.a = foo.a and foo2.b = foo.b
                       );
    

    Note that an index on (a, b, id) would help performance.

    You can also phrase this as a join:

    delete from foo
        using (select a, b, max(id) as max_id
               from foo
               group by a, b
              ) ab
        where foo.a = a.a and foo.b = ab.b and foo.id < ab.max_id;