Search code examples
sqlpostgresqlpostgresql-9.1postgresql-9.3rails-postgresql

How to Delete all duplicate rows in postgres sql based on 3 duplicate columns and out of that keep max and second max date value rows


I have a table temp it has 118507658 records . I need to clean up this table based on below condition.

If three columns that is (dp_content_definition_id,dp_order ,value_scope_id) are identical then delete these record but keep max and second max entry_date record.

I wrote this query to find the records .

select value_id ,content_definition_id ,order ,value_scope_id ,entry_date 
from temp ou
where (select count(*) from dp_values inr
where 
inr.content_definition_id = ou.content_definition_id 
and inr.order = ou.order 
and inr.value_scope_id = ou.value_scope_id ) > 3
order by content_definition_id,order ,value_scope_id
 limit 10000;

But my first problem is that it takes so much time just to find 10000 records and no idea how much time it will take to find to take all records. after that i dont know query to how to delete .


Solution

  • Simply use a Windowed Aggregate Function, create a ROW_NUMBER based on descending dates and look for those with a value >=3.

    select 
        select value_id ,content_definition_id ,order ,
        value_scope_id ,entry_date,
    from
     (
        select value_id ,content_definition_id ,order ,
           value_scope_id ,entry_date,
           row_number(*)
           over (partition by content_definition_id ,order ,value_scope_id 
                 order by entry_date desc ) as rn
        from temp ou
     ) dt
    where rn >= 3
    order by content_definition_id,order ,value_scope_id;
    

    This will return the data to be deleted. If you change the condition to where rn < 3 you get the data you want to keep. Depending on the percentage it might be more efficient to Insert/Select the remaining rows into a new table instead of deleting the old ones.