Search code examples
sqlpostgresqlduplicatesgreatest-n-per-groupsql-delete

How to break ties when comparing columns in SQL


I am trying to delete duplicates in Postgres. I am using this as the base of my query:

DELETE FROM case_file as p
WHERE EXISTS (
    SELECT FROM case_file as p1
    WHERE p1.serial_no = p.serial_no
    AND p1.cfh_status_dt < p.cfh_status_dt
    );

It works well, except that when the dates cfh_status_dt are equal then neither of the records are removed.

For rows that have the same serial_no and the date is the same, I would like to keep the one that has a registration_no (if any do, this column also has NULLS).

Is there a way I can do this with all one query, possibly with a case statement or another simple comparison?


Solution

  • DELETE FROM case_file AS p
    WHERE  id NOT IN (
       SELECT DISTINCT ON (serial_no) id  -- id = PK
       FROM   case_file 
       ORDER  BY serial_no, cfh_status_dt DESC, registration_no
       );
    

    This keeps the (one) latest row per serial_no, choosing the smallest registration_no if there are multiple candidates.

    NULL sorts last in default ascending order. So any row with a not-null registration_no is preferred.

    If you want the greatest registration_no instead, to still sort NULL values last, use:

       ...
       ORDER  BY serial_no, cfh_status_dt DESC, registration_no DESC NULLS LAST
    

    See:

    If you have no PK (PRIMARY KEY) or other UNIQUE NOT NULL (combination of) column(s) you can use for this purpose, you can fall back to ctid. See:

    NOT IN is typically not the most efficient way. But this deals with duplicates involving NULL values. See:

    If there are many duplicates - and you can afford to do so! - it can be (much) more efficient to create a new, pristine table of survivors and replace the old table, instead of deleting the majority of rows in the existing table.

    Or create a temporary table of survivors, truncate the old and insert from the temp table. This way depending objects like views or FK constraints can stay in place. See:

    Surviving rows are simply:

    SELECT DISTINCT ON (serial_no) *
    FROM   case_file 
    ORDER  BY serial_no, cfh_status_dt DESC, registration_no;