Search code examples
oracleplsqltimestampclob

modify CLOB to remove duplicates


I have a table, which contains a CLOB. Based on the CLOB I called a function to create a hash key.

Since some of the CLOBs were duplicates, which in turn generated duplicate hash keys, which prevents me from creating a PRIMARY KEY on the column hash_val.

Below are grouped duplicates.

What I want to do is modify every CLOB in the list depending on the group by appending a timestamp at the bottom of each CLOB so there will no longer be duplicates and then I can add the PRIMARY KEY.

I was hoping someone can help me out by generating a loop and appending the timestamp and I'll put an INSERT/update trigger on the table to generate future hash values.

CREATE table table_z(
seq_num integer  GENERATED BY DEFAULT AS IDENTITY (START WITH 1) NOT NULL,
val NUMBER, 
hash_val VARCHAR2(1000), 
clob_val CLOB);


INSERT into table_z (    VAL,    HASH_VAL,    CLOB_VAL) VALUES (
    1,    '4714870AFF6C97CA09D135834FDB58A6389A50C11FEF8EC4AFEF466FB60A23AC6B7A9C92658F14DF4993D6B40A4E4D8424196AFC347E97640D68DE61E1CF14B0',    'aaaaaaaaaa');

INSERT into table_z (    VAL,    HASH_VAL,    CLOB_VAL) VALUES (
    1,    'F368A29B71BD201A7EF78B5DF88B1361FBE83F959756D33793837A5D7B2EAF660F2F6C7E2FBACE01965683C4CFAFDED3FF28AAB34E329AA79BC81E7703F68B86',    'aaaaa');

INSERT into table_z (    VAL,    HASH_VAL,    CLOB_VAL) VALUES (
    2,    '517C1CDB694A83ABF80A1D91EE91059B6443769DBEDDF3F5CC583CCCCC1CCDFE9E5330C61830D9E25AF03536909E8272F056C8FF1FBC9AABD3492C291A735B58',    'Xaaaaaaaaa');

INSERT into table_z (    VAL,    HASH_VAL,    CLOB_VAL) VALUES (
    2,    'D597AD764E82E38DED6184527197C5CA39743F805F1D2355A89E62ECA275D62CD545DDFA57A36B37C711527A63717A69586CBE78AD056A92A0C6479391FC2349',    'xxxx');

INSERT into table_z (    VAL,    HASH_VAL,    CLOB_VAL) VALUES (
    3,    '9B71D224BD62F3785D96D46AD3EA3D73319BFBC2890CAADAE2DFF72519673CA72323C3D99BA5C11D7C7ACC6E14B8C5DA0C4663475C2E5C3ADEF46F73BCDEC043',    'hello');

INSERT into table_z (    VAL,    HASH_VAL,    CLOB_VAL) VALUES (
    1,    '4714870AFF6C97CA09D135834FDB58A6389A50C11FEF8EC4AFEF466FB60A23AC6B7A9C92658F14DF4993D6B40A4E4D8424196AFC347E97640D68DE61E1CF14B0',    'aaaaaaaaaa');

INSERT into table_z (    VAL,    HASH_VAL,    CLOB_VAL) VALUES (
    2,    '4714870AFF6C97CA09D135834FDB58A6389A50C11FEF8EC4AFEF466FB60A23AC6B7A9C92658F14DF4993D6B40A4E4D8424196AFC347E97640D68DE61E1CF14B0',    'aaaaaaaaaa');

INSERT into table_z (    VAL,    HASH_VAL,    CLOB_VAL) VALUES (
    2,    '6522DA2F3FE4F163D52ACEF62440C086BE5EC1203C2CE90A5427546A1CAFE6440618FD3AF2C8A3362AB7BC7544600CA77BED41F95D8038A8A7CC458177691474',    'oracle');

INSERT into table_z (    VAL,    HASH_VAL,    CLOB_VAL) VALUES (
    3,    '6522DA2F3FE4F163D52ACEF62440C086BE5EC1203C2CE90A5427546A1CAFE6440618FD3AF2C8A3362AB7BC7544600CA77BED41F95D8038A8A7CC458177691474',    'oracle');

SELECT 
listagg(seq_num,',') within group(order by seq_num) seq_num,
hash_val, COUNT(hash_val)
FROM table_z
GROUP BY hash_val
HAVING COUNT(hash_val) > 1;

SEQ_NUM    HASH_VAL    COUNT(HASH_VAL)
1,6,7    4714870AFF6C97CA09D135834FDB58A6389A50C11FEF8EC4AFEF466FB60A23AC6B7A9C92658F14DF4993D6B40A4E4D8424196AFC347E97640D68DE61E1CF14B0    3
8,9    6522DA2F3FE4F163D52ACEF62440C086BE5EC1203C2CE90A5427546A1CAFE6440618FD3AF2C8A3362AB7BC7544600CA77BED41F95D8038A8A7CC458177691474    2


Solution

  • You don't really need a loop, you can do this with an UPDATE. I usually see it done like this, where you pick which duplicate you don't want to change, usually the min or max of a series, and change all the ones greater/less than that reference.

    update table_z
    set clob_val = clob_val || (systimestamp +seq_num/10000)
    where exists -- only update if there's a duplicate with a lower seq_num
      (select 1 from table_z min_z 
       where min_z.hash_val = table_z.hash_val 
       and min_z.seq_num < table_z.seq_num)
    

    I added +seq_num/10000 to the timestamp so it'll add a few seconds for each row, so that none of the clobs in each series get the same timestamp.