Search code examples
sqloracle-databaseoracle11g

Filtering duplicate values, searching for pairs


We have a plugin, which imports an XML file from an FTP server into the database. This XML is further processed, and data is inserted into the database tables. These rows are added to the "pricing" table (with help of the package w 10k+ rows), where prices can be set. My problem is that duplicate values often come through during the import. The system processes them, but during pairing, only the first match is taken into table.

To simplify, there is a table.. the important columns are: the ID, the column "SITE", and the column LINKED, where the ID is attached with an exclamation mark (!) to the PK.

How can I find the pairs and filter out those that don’t have a pair or those that are linked to the same ID more than once?

After this, I would like to make a procedure to update the column to fix the pairing - with the correct ID.

table:

enter image description here

I have colored the pairs to make it easier to visualize and understand what the issue is. Here you can see, that SITE: 292 ID: 4 is not paired to ID:6.

I have no IDEA, how can i link the correct.. i can find the not good one-s, with sub-select:

select * from pricing p where p.PK='3452400012026' 
 and p.site='292' 
 and 
 (select count(*) from pricing po 
  where po.PK = p.PK and 
   po.LINKED=p.PK||'!'||p.ID) != 1

So if it's not 1, it's problematic...

EDIT:

SQL FIDDLE WITH COMMENTS


Solution

  • If I'm not wrong - this could be done with SQL alone using MERGE INTO to do the update. Here is the code with link to the fiddle. Details below the code:

    MERGE INTO PRICING p
    USING ( WITH
                links AS
                    ( Select     p.ID, p.LINKED, p.SITE, p.PK, 
                                 To_Number(Substr(p.LINKED, InStr(p.LINKED, '!') + 1)) "LINK_ID",
                                 Case When p.LINKED Is Not Null
                                      Then Row_Number() Over(Partition By p.SITE, p.PK, To_Number(Substr(p.LINKED, InStr(p.LINKED, '!') + 1))
                                                             Order By p.SITE, p.PK, p.ID) 
                                 End "RN_LINK",
                                 Count(p.ID) Over(Partition By p.SITE, p.PK, To_Number(Substr(p.LINKED, InStr(p.LINKED, '!') + 1))) "COUNT_LINK"
                      From       pricing p
                      Where      p.LINKED Is Not Null
                    ),
                grid AS
                    ( Select   * 
                        From links
                      UNION ALL
                       Select     ID, LINKED, SITE, PK, ID, 
                                  Row_Number() Over(Partition By SITE, PK Order By ID), 
                                  Count(ID) Over()
                       From       pricing
                       Where      LINKED Is Null And
                                  ID Not In(Select LINK_ID From links)
                    )
    --    M a i n   S Q L :
            SELECT   *   
            FROM   ( Select     g.ID, g.LINKED, g.SITE, g.PK, 
                                Case When g.COUNT_LINK > 1 And
                                          g.LINKED Is Not Null And
                                          g.RN_LINK = 1 
                                     Then g.PK || '!' || g.LINK_ID
                                     When g.COUNT_LINK > 1 And 
                                          g.RN_LINK > 1 And 
                                          g.ID != g.LINK_ID
                                     Then g.PK || '!' || ( Select ID From grid Where LINKED Is Null And RN_LINK = g.RN_LINK - 1 )
                               End "LINKED_2"
                     From     grid g
                   )
            WHERE    LINKED_2 Is Not Null And SubStr(LINKED_2, -1) != '!'
        ) x ON (x.ID = p.ID)
    WHEN MATCHED 
    THEN UPDATE SET p.LINKED = x.LINKED_2
    WHERE p.LINKED != x.LINKED_2
    

    See the fiddle and the result here.

    Details:

    1. create a cte (links) to get ID - LINK_ID paires by extracting LINK_ID from LINKED column ...
    --    l i n k s :
    /*
    ID  LINKED           SITE PK            LINK_ID RN_LINK COUNT_LINK
    --- ---------------  ---- ------------- ------- ------- ----------
    1   3452400012026!2  99   3452400012026       2       1          1
    3   3452400012026!8  99   3452400012026       8       1          2
    5   3452400012026!7  99   3452400012026       7       1          1
    6   3452400012026!8  99   3452400012026       8       2          2
    9   3452400012026!10 99   3452400012026      10       1          1      */
    --  here we have the ID - LINK_ID pairs along with row numbers per LINK_ID with total rows per LINK_ID
    --  notice that LINK_ID 8 has two rows - for IDs 3 and 6
    
    1. Create another cte (grid) to union the rows with null in LINKED column not present in the links cte above
    --    g r i d 
    /*
    ID  LINKED           SITE PK            LINK_ID RN_LINK COUNT_LINK
    --- ---------------  ---- ------------- ------- ------- ----------
    1   3452400012026!2  99   3452400012026       2       1          1
    3   3452400012026!8  99   3452400012026       8       1          2
    4   null             292  3452400012026       4       1          1
    5   3452400012026!7  99   3452400012026       7       1          1
    6   3452400012026!8  99   3452400012026       8       2          2
    9   3452400012026!10 99   3452400012026      10       1          1      */
    
    1. In main sql generate LINKED_2 column for rows with duplicate LINK_IDs (ID 3 and 6) using row not present in links (ID 4):
    • first duplicate row (RN_LINK = 1) will have the LINK_ID as it already has
    • ssecond duplicate row (RN_LINK = 2) will fetch the first not paired ID ffrom grid (4 in this case)
    • if there was third same LINK_ID the correlated subquery would try to fetch the second not paired ID ... and so on ...
    --    m a i n : 
    /*
    ID  LINKED            SITE  PK              LINKED_2
    --  ----------------  ----  --------------  ---------------
     3  3452400012026!8   99    3452400012026   3452400012026!8
     6  3452400012026!8   99    3452400012026   3452400012026!4      */
    

    The resulting dataset is MERGED INTO pricing to do the update.