Search code examples
postgresqlmd5postgresql-14drop-duplicates

PostgreSQL: Delete duplicate rows based on matching md5 hashes


I have a table which is populated by data scraped from the web. After adding the new data to the table I want to delete duplicates if no changes have been made to specific columns.

I have tried the following query along with various ways of referencing the md5 hash, and casting column1 as text:

DELETE FROM
    my_table a
        USING my_table b
WHERE
    a.pk < b.pk
    AND a.md5(column1) = b.md5(column1);

and receive the following error:

ERROR: column reference "column1" is ambiguous 
LINE 6: AND a.md5(column1) = b.md5(column1); 
                  ^ 
SQL state: 42702 Character: 99

Solution

  • Don't deduplicate based on md5(), just use =

    Let Postgres figure out if it benefits from hashing at all, and if so, what the optimal hashing function should be. It's easier for you to declare, easier for Postgres to optimise and execute. SQL is meant to be declarative - state what you want, not how exactly it's supposed to be done.

    That way you also avoid potentially incorrect results - md5 sticks around just for simplicity but it has long been considered unreliable. It's rare but not impossible for your values to be completely different but share the same md5, so you may get undeserved deletes.

    If Postgres sees the values are too long to be compared with a plain = effectively, it'll change the value equality into value hash equality behind the scenes, using a hashing function of its own choice:

    explain analyze verbose
    DELETE FROM my_table a
    USING my_table b
    WHERE a.pk < b.pk
    AND a.column1 = b.column1
    returning *;
    
    QUERY PLAN
    Delete on public.my_table a (cost=3758.00..10978.49 rows=7480 width=650) (actual time=26.669..68.048 rows=1858 loops=1)
    ...
      -> Hash Join (cost=3758.00..10978.49 rows=7480 width=650) (actual time=26.642..64.441 rows=1985 loops=1)
    ...
            Hash Cond: (a.column1 = b.column1)
    ...
            -> Hash (cost=1867.00..1867.00 rows=20000 width=644) (actual time=26.444..26.445 rows=20000 loops=1)
    ...
    Execution Time: 68.541 ms

    Notice how the version with md5 affects the exact same amount of rows, just 4 times slower: demo

    explain analyze verbose
    DELETE FROM my_table a
    USING my_table b
    WHERE a.pk < b.pk
    AND md5(a.column1) = md5(b.column1)
    returning *;
    
    Delete on public.my_table a (cost=18078.54..63328.54 rows=666667 width=650) (actual time=215.633..249.269 rows=1858 loops=1)
      Output: a.pk, a.column1, b.pk, b.column1
      -> Merge Join (cost=18078.54..63328.54 rows=666667 width=650) (actual time=215.603..246.022 rows=1985 loops=1)
    ...
    Execution Time: 252.893 ms

    As a bonus, if future versions of Postgres get improved in terms of the hash-based operations, you'll get that for free with =, without having to revisit and update any code.