Search code examples
sqlpostgresqlpostgresql-8.4

Find duplicate records in large table on multiple columns the right way


I have read many threads on this subject now and tried a few things but it has not worked as I hoped. I need some clarification and apologize if this is considered a duplicate thread.

A client of mine hosts a Postgres database where one table holds a little more then 12 million records. They have tasked me with finding duplicate records, extract them for viewing and if everything looks ok, delete the duplicates.

My main concern has been performance on the server. Running DISTINCT queries on 12 million records must consume a lot of resources?

Since my first task is to extract the records for viewing in, say a CSV, and not simply deleting them my approach in PgAdmin was executing this to a file.

SELECT * 
FROM
    my_table
WHERE
my_table_id NOT IN (

                SELECT DISTINCT 
                    ON (
                        num_1,
                        num_2,
                        num_3,
                        num_4,
                        num_5,
                        my_date
                    )
                    my_table_id
                FROM
                    my_table
);

However this query takes way to long. After 20 minutes of execution time I halted the execution. To make things more complex my client is reluctant to allow me to clone a local copy of the table because of strict security. They prefer it is all done on the live hosting environment.

The table definition is quite simple. It looks like this

CREATE TABLE my_table
(
    my_table_id bigserial NOT NULL,
    num_1 bigserial NOT NULL,
    num_2 bigserial NOT NULL,
    num_3 bigserial NOT NULL,
    num_4 numeric,
    num_5 integer,
    my_date date,
    my_text character varying
)

The primary key "my_table_id" has not been compromised and is always unique. The col "my_text" is not interesting in the query since it will be empty for all duplicates. It is only the numeric fields and the date that needs matching. All columns (except my_table_id and my_text) must match across records to qualify as a duplicate.

What is the best way to solve this? Is there a server-friendly way that won´t eat all resources on the host environment? Please help me understand the best approach!

Thanks you!


Solution

  • Need to use GROUP BY and HAVING to get duplicate records instead of DISTINCT

    subquery will find all duplicate records

    SELECT * FROM
    my_table mt
    JOIN
    (
         SELECT
                num_1,
                num_2,
                num_3,
                num_4,
                num_5,
                my_date
         FROM
                my_table
         GROUP BY num_1, num_2, num_3, num_4, num_5, my_date
         HAVING COUNT(*) >1
    ) T 
    ON mt.num_1= T.num_1
    and mt.num_2= T.num_2
    and mt.num_3= T.num_3
    and mt.num_4= T.num_4
    and mt.num_5= T.num_5
    and mt.my_date= T.my_date