Search code examples
sqlpostgresqlduplicatessql-updatesql-delete

Merge duplicate table rows that are used in a relation with another table


I have the following table structure:

table_a
id | customer_id | product_id
---+-------------+------
 1 | c1          | p1
 2 | c1          | p1
 3 | c2          | p1

table_b
id | table_a_id  | attribute
---+-------------+------
 99 | 1          | a1
 98 | 2          | a2
 97 | 3          | a3

As you can see table_a have duplicates values, and I want to merge them. Unfortunatly the table_a PK is also used on table_b.

The final resoult should be:

table_a
id | customer_id | product_id
---+-------------+------
 1 | c1          | p1
 3 | c2          | p1

table_b
id | table_a_id  | attribute
---+-------------+------
 99 | 1          | a1
 98 | 1          | a2
 97 | 3          | a3

I have to update the table_b relation with table_a, and than clear all the unsed keys on table_a.

Unfortunatly the only query I've thought of is really heavy and the DB timeout before can be completed. table_a have 200k+ records and table_b is at least twice that.

My idea was:

  • join table_a with table_b , to get: (table_b_id, table_a_customer_id, table_a_product_id)
  • get the grouped up version of table_a. (to get the right id of table_a I just used min("id")
  • inner join the two above and use the result to update table_b.

Solution

  • Here is one option using common table expressions:

    with 
        ta as (
            select ta.*, min(id) over(partition by customer_id, product_id) min_id
            from table_a ta
        ),
        upd as (
            update table_b tb
            set table_a_id = ta.min_id
            from ta
            where tb.table_a_id = ta.id and ta.id <> ta.min_id
        )
    delete from table_a ta1
    using ta
    where 
        ta1.customer_id = ta.customer_id
        and ta1.product_id = ta.product_id
        and ta1.id > ta.id
    

    The first CTE associates the target id to each row of table_a. Then, we use that information to update table_b. Finally, we delete duplicate rows in table_a, retaining the earliest id only.