I have a table like this:
CREATE TABLE foo (
id SERIAL,
a INTEGER,
b INTEGER,
some DOUBLE,
other VARCHAR,
data INTEGER
);
And I know that (a, b)
must be unique, but for some reasons, this is not enforced by the database. Suppose I don't care about further discrimination, I just wipe duplicates from time to time, and maybe only keep the most recent one (with the biggest ID):
DELETE FROM foo WHERE id NOT IN (SELECT max(id) FROM foo GROUP BY a, b);
Now suppose that there are several 100.000 records, the IN (...)
becomes pretty large.
I read about self-joins (like: DELETE FROM foo a LEFT JOIN foo b ON a.a = b.a AND a.b = b.b WHERE b.id < a.id
), but this also means that I get huge intermediate tables if I have many duplicates.
What alternatives do I have?
If you want to delete older duplicate values, you can use:
delete from foo
where foo.id < (select max(foo2.id)
from foo foo2
where foo2.a = foo.a and foo2.b = foo.b
);
Note that an index on (a, b, id)
would help performance.
You can also phrase this as a join:
delete from foo
using (select a, b, max(id) as max_id
from foo
group by a, b
) ab
where foo.a = a.a and foo.b = ab.b and foo.id < ab.max_id;