Search code examples
oracle-databaseoracle11goracle19c

Remove duplicate with Merge in Oracle


I have a table without a primary key where more than 13,000 records are duplicates on the OBJID column, but the values in the other columns are different. I want to merge both duplicate records into a single one. Is it possible to do this with a merge statement or do I need to update the table with a loop and fix it manually?

Here is how the table looks: enter image description here

OBJID,INVKLASSE,NETTTYPE,NAVN,ALT_NAVN,ON_ID,ON_TYPE,ON_STED_NR,ON_OVER_ID -219468,NYINVESTERING,,,,,,, -219468,,,,,277615,50,277614,-99 -218981,NYINVESTERING,,,,,,, -218981,,,,,277335,50,277334,-99 -218761,NYINVESTERING,,,,,,, -218761,,,,,277909,50,277908,-99 -218709,NYINVESTERING,,,,,,, -218709,,,,,284646,50,284645,-99

I have created a temporary table and I am trying to insert all duplicate records where OBJID is the primary key.

select * from (
select
   objid,INVKLASSE,NETTTYPE,NAVN,ALT_NAVN,ON_ID,ON_TYPE,ON_STED_NR,ON_OVER_ID,
   row_number() over(partition by objid order by objid) as DUP
from USEROPENNIS
)
where dup=2;

I have created some SQL like this to separate all duplicate records where DUP=1 or 2. All DUP=1 can be inserted as unique records, but DUP=2 is 13,000 records where I need to update the table where all columns are null or something similar.

I'm looking for suggestions on anything that can quickly fix it.

Expected output. enter image description here


Solution

  • To collapse duplicate records into one while picking column values from each, you will need to use a GROUP BY on your primary key combined with aggregate functions (like MAX) around all your non-key columns, with possible DECODE/CASE inside each aggregate that try to decide which values to prioritize.

    If your only issue is some rows have NULL for all the columns and other rows have data, then delete the rows with NULL values. Or if it's mixed but still only a NULL vs. non-NULL issue, MAX by itself will pick out the non-NULL value. If it's more complicated than this and you have multiple non-NULL values to choose from, you have to write the logic to pick the one you want.

    At its simplest:

    SELECT objid,
           MAX(invklasse),MAX(nettype),MAX(navn), etc...
      FROM USEROPENNIS
    GROUP BY objid
    

    More complex (this can look like anything, this just gives the general idea):

    SELECT objid,
           MAX(NULLIF(invklasse,'BADVALUE')),
           SUBSTR(MIN(DECODE(nettype,'nicevalue','1','lessnicevalue','2','3')||nettype),2) nettype, etc..
      FROM USEROPENNIS
    GROUP BY objid
    

    Once you get the data the way you want it in your SELECT query, you place the query you come up with in a CTAS to create a temp table:

    CREATE TABLE tmp$1 AS SELECT ...
    

    Now truncate the original table and insert into it the contents of your temporary table:

    TRUNCATE TABLE useropennis;
    INSERT /*+ append */ INTO useropennis SELECT * FROM tmp$1;
    COMMIT;
    

    Now create a PK on it so you don't get into this situation again:

    ALTER TABLE useropennis ADD CONSTRAINT pk_useropennis PRIMARY KEY (objid);