Search code examples
postgresqlsql-deletepostgresql-9.1large-data

delete duplicate rows in large postgresql database table


I have a postgresql database with 100 GB size. One of the tables has about half a billion entries. For quick data entry, some of the data was repeated and left to be pruned later. One of the columns can be used to identify the rows as unique.

I found this stackoverflow question which suggested a solution for mysql:

ALTER IGNORE TABLE table_name ADD UNIQUE (location_id, datetime)

Is there anything similar for postgresql?

I tried deleting with group by and row number, my computer runs out of memory after a few hours in both cases.

This is what I get when I try to estimate the number of rows in the table:

SELECT reltuples FROM pg_class WHERE relname = 'orders';
  reltuples  
-------------
 4.38543e+08
(1 row)

Solution

  • Two solutions immediately come to mind:

    1). Create a new table as select * from source table with a WHERE clause to determine the unique rows. Add the indexes to match the source table, then rename them both in a transaction. Whether or not this will work for you depends on several factors, including amount of free disk space, if the table is in constant use and interruptions to access are permissible, etc. Creating a new table has the benefit of tightly packing your data and indexes, and the table will be smaller than the original because the non-unique rows are omitted.

    2). Create a partial unique index over your columns and add a WHERE clause to filter out the non-uniques. For example:

    test=# create table t ( col1 int, col2 int, is_unique boolean);
    CREATE TABLE
    
    test=# insert into t values (1,2,true), (2,3,true),(2,3,false);
    INSERT 0 3
    
    test=# create unique index concurrently t_col1_col2_uidx on t (col1, col2) where is_unique is true;
    CREATE INDEX
    
    test=# \d t
            Table "public.t"
      Column   |  Type   | Modifiers 
    -----------+---------+-----------
     col1      | integer | 
     col2      | integer | 
     is_unique | boolean | 
    Indexes:
        "t_col1_col2_uidx" UNIQUE, btree (col1, col2) WHERE is_unique IS TRUE