This code gives me a table of the unique values (without duplicates):
SELECT id, firstname, lastname, startdate, position
FROM (
SELECT id, firstname, lastname, startdate, position,
ROW_NUMBER() OVER (PARTITION BY (firstname, lastname) ORDER BY startdate DESC) rn
FROM people
) tmp
WHERE rn = 1;
What syntax would replace the current table with just the results of this one?
Alternatively, I could use WHERE rn <> 1
to get all the data I want to delete, but again, I am struggling to get the syntax of the DELETE
right using this method.
Assuming values in firstname
, lastname
and startdate
are never NULL
, this simple query with a NOT EXISTS
anti-semi-join does the job:
DELETE FROM people AS p
WHERE EXISTS (
SELECT FROM people AS p1
WHERE p1.firstname = p.firstname
AND p1.lastname = p.lastname
AND p1.startdate > p.startdate
);
It deletes every row where a newer copy exists, effectively keeping the latest row per group of peers. (Of course, (firstname, lastname)
is a poor way of establishing identity. There are many distinct people with identical names. The demo may be simplified ...)
Can there be identical values in startdate
? Then you need a tiebreaker ...
Typically faster than using a subquery with row_number()
. There are a hundred and one ways to make this faster, depending on your precise situation and requirements. See:
If compared columns can be NULL
, consider:
There is a whole dedicated tag for duplicate-removal. Combine it with postgres to narrow down: