Search code examples
sqlpostgresqlduplicatessql-updatesql-delete

Complicated SQL for update and delete on a large table


I am having a complicated SQL problem in PostgreSQL.

Suppose I have a large table called 'selling_prices'. It contains around 19 million rows. I want to remove some duplicate rows and also update some data. Here is the table structure:

seq customer_co_cd item_sku seliing_tanka_rate updatedate
1 1414343 sku001 0.4 2021-01-18 14:34:48
2 1414343 sku001 0.4 2021-01-18 14:34:48
3 1414343 sku001 0.4 2021-01-16 01:34:48
4 1512333 sku002 0.2 2021-01-16 01:34:48
5 1512333 sku002 0.5 2021-01-16 01:34:48

and so on....

Condition 1: If the customer_co_cd and item_sku and selling_tanka_rate is same update the latest updatedate to '2021/11/12' and delete the other data.

After the SQL table should be like: delete seq(2,3) and update seq 1

seq customer_co_cd item_sku seliing_tanka_rate updatedate
1 1414343 sku001 0.4 2021-11-12 00:00:00

Condition 2: If the (customer_co_cd and item_sku) is same and selling_tanka_rate is different then get the data as group

customer_co_cd item_sku count
1512333 sku002 2

I tried some query using group by but it is slow...

SELECT customer_co_cd, item_sku, COUNT(*) 
FROM selling_prices 
GROUP BY customer_co_cd,item_sku 
HAVING COUNT(*) > 1

I don't know how to query the condition 1. Also what is the efficient way to get condition 2. Keep in mind that there are around 19 million data.

Should I create a script or is there a efficient query I can use.


Solution

  • This should answers your needs (result here)

    select * from t1 order by seq;
    
    -- Update first
    with t2 as (
      select 
        row_number() over (partition by customer_co_cd,item_sku,seliing_tanka_rate order by customer_co_cd,item_sku,seliing_tanka_rate,seq) as rn,
        lead(seliing_tanka_rate) over (partition by customer_co_cd,item_sku order by customer_co_cd,item_sku,seq) as lead,
        * 
      from t1)
    update t1
    set updatedate = '20211112'
    from t2
    where t2.seq = t1.seq and t2.rn = 1
    and t2.seliing_tanka_rate = t2.lead;
    
    -- delete to keep the wanted records
    with t2 as (select row_number() over (partition by customer_co_cd,item_sku,seliing_tanka_rate order by customer_co_cd,item_sku,seliing_tanka_rate,seq) as rn,* from t1)
    delete 
    from t1
    where seq in (select seq from t2 where rn > 1);
    
    select * from t1 order by seq;
    
    -- Condition 2
    with t2 as (
      select *,
      lead(customer_co_cd) over (partition by customer_co_cd,item_sku) as co_cd,
      lead(item_sku) over (partition by customer_co_cd,item_sku) as sku,
      lead(seliing_tanka_rate) over (partition by customer_co_cd,item_sku) as rate
      from t1
      )
    select customer_co_cd,item_sku,
    count(*) filter (where customer_co_cd = t2.co_cd and item_sku = t2.sku and seliing_tanka_rate <> t2.rate) + 1 as count
    from t2
    group by customer_co_cd,item_sku
    having count(*) filter (where customer_co_cd = t2.co_cd and item_sku = t2.sku and seliing_tanka_rate <> t2.rate) + 1 > 1