sql postgresql duplicates sql-update sql-delete

Merge duplicate table rows that are used in a relation with another table

I have the following table structure:

table_a
id | customer_id | product_id
---+-------------+------
 1 | c1          | p1
 2 | c1          | p1
 3 | c2          | p1

table_b
id | table_a_id  | attribute
---+-------------+------
 99 | 1          | a1
 98 | 2          | a2
 97 | 3          | a3

As you can see table_a have duplicates values, and I want to merge them. Unfortunatly the table_a PK is also used on table_b.

The final resoult should be:

table_a
id | customer_id | product_id
---+-------------+------
 1 | c1          | p1
 3 | c2          | p1

table_b
id | table_a_id  | attribute
---+-------------+------
 99 | 1          | a1
 98 | 1          | a2
 97 | 3          | a3

I have to update the table_b relation with table_a, and than clear all the unsed keys on table_a.

Unfortunatly the only query I've thought of is really heavy and the DB timeout before can be completed. table_a have 200k+ records and table_b is at least twice that.

My idea was:

join table_a with table_b , to get: (table_b_id, table_a_customer_id, table_a_product_id)
get the grouped up version of table_a. (to get the right id of table_a I just used min("id")
inner join the two above and use the result to update table_b.

Solution

Here is one option using common table expressions:

with 
    ta as (
        select ta.*, min(id) over(partition by customer_id, product_id) min_id
        from table_a ta
    ),
    upd as (
        update table_b tb
        set table_a_id = ta.min_id
        from ta
        where tb.table_a_id = ta.id and ta.id <> ta.min_id
    )
delete from table_a ta1
using ta
where 
    ta1.customer_id = ta.customer_id
    and ta1.product_id = ta.product_id
    and ta1.id > ta.id

The first CTE associates the target id to each row of table_a. Then, we use that information to update table_b. Finally, we delete duplicate rows in table_a, retaining the earliest id only.