Search code examples
sqlpostgresqlduplicatessql-deleterow-value-expression

Unable to delete duplicate rows with PostgreSQL


My query deletes the whole table instead of duplicate rows. Video as proof: https://streamable.com/3s843

create table customer_info (
    id INT,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    phone_number VARCHAR(50)
);
insert into customer_info (id, first_name, last_name, phone_number) values
(1, 'Kevin', 'Binley', '600-449-1059'),
(1, 'Kevin', 'Binley', '600-449-1059'),
(2, 'Skippy', 'Lam', '779-278-0889');

My query:

with t1 as (
select *, row_number() over(partition by id order by id) as rn
from customer_info)

delete
from customer_info 
where id in (select id from t1 where rn > 1);

Solution

  • Your query would delete all rows from each set of dupes (all share the same id by which you select - that's what wildplasser hinted at with subtle comments). Only initially unique rows would survive. So if it "deletes the whole table", that means there were no unique rows at all.

    In your query, dupes are defined by (id) alone, not by the whole row as your title suggests.

    Either way, there is a remarkably simple solution:

    DELETE FROM customer_info c
    WHERE  EXISTS (
       SELECT FROM customer_info c1
       WHERE  ctid < c.ctid
       AND    c1 = c  -- comparing whole rows
       );
    

    Since you deal with completely identical rows, the remaining way to tell them apart is the internal tuple ID ctid.

    Careful with table inheritance or partitioning. Then there can be multiple physical tables involved and ctid is not unique within the scope. (But there shouldn't be completely identical rows in partition or inheritance children.) See:

    My query deletes all rows, where an identical row with a smaller ctid exists. Hence, only the "first" row from each set of dupes survives.

    Notably, NULL values compare equal in this case - which is most probably as desired. The manual:

    The SQL specification requires row-wise comparison to return NULL if the result depends on comparing two NULL values or a NULL and a non-NULL. PostgreSQL does this only when comparing the results of two row constructors (as in Section 9.23.5) or comparing a row constructor to the output of a subquery (as in Section 9.22). In other contexts where two composite-type values are compared, two NULL field values are considered equal, [...]

    If dupes are defined by id alone (as your query suggests), then this would work:

    DELETE FROM customer_info c
    WHERE  EXISTS (
       SELECT FROM customer_info c1
       WHERE  ctid < c.ctid
       AND    id = c.id
       );
    

    But then there might be a better way to decide which rows to keep than ctid as a measure of last resort!

    Obviously, you would then add a PRIMARY KEY to avoid the initial dilemma from reappearing. For the second interpretation, id is the candidate.

    Related:

    About ctid: