Search code examples
sqloraclesplitregexp-replaceregexp-substr

Comparing comma separated values from two columns of two different tables


I want to compare the values of two columns (diff table) having comma separated values of two different Oracle tables. I want to find rows that match with all values (NAME1 all values should match with NAME2 values).

Note: The comma separated values are in different order.

Example:

T1:

ID_T1             NAME1
===================================


1      ASCORBIC ACID, PARACETAMOL, POTASSIUM HYDROGEN CARBONATE
2      SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS, CITRIC ACID
3      CAFFEINE, PARACETAMOL PH. EUR.
4      PSEUDOEPHEDRINE HYDROCHLORIDE,DEXCHLORPHENIRAMINE MALEATE

T2:

ID_T2          NAME2
=================================

 4      POTASSIUM HYDROGEN CARBONATE, ASCORBIC ACID, PARACETAMOL
 5      SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS
 6      PARACETAMOL PH. EUR.,CAFFEINE
 7      CODEINE PHOSPHATE, PARACETAMOL DC
 8      DEXCHLORPHENIRAMINE MALEATE, DEXTROMETHORPHAN HYDROBROMIDE 
10      DEXCHLORPHENIRAMINE MALEATE, PSEUDOEPHEDRINE HYDROCHLORIDE

MY RESULT should only show the matching row based on ALL NAME Matches in both tables.

    ID_T1    ID_T2    MATCHING NAME
    ==================================
    1            4    POTASSIUM HYDROGEN CARBONATE, ASCORBIC ACID, PARACETAMOL
    3            6    PARACETAMOL PH. EUR.,CAFFEINE
    4           10    PSEUDOEPHEDRINE HYDROCHLORIDE,DEXCHLORPHENIRAMINE MALEATE

Tried with REGEXP_SUBST but wasn't able to make it work.

I used the below code to parse the values:

SELECT REGEXP_SUBSTR (NAME1, '[^,]+', 1, ROWNUM)
            FROM T1
           CONNECT BY ROWNUM <= LENGTH (NAME1) - 
           LENGTH (REPLACE (NAME, ',')) + 1

Solution

  • You could get the table(s) into first normal form and then compare the compounds that are stored in each row. A starting point could be:

    {1} Tokenize each row, and write the tokens into a new table. Give each token its original ID plus a 3-letter prefix, indicating which table the token came from. {2} Group the rows of the new ("normalized") table by ID, and perform a LISTAGG(). Perform a self join, and find matching "token groups".

    {1} Tokenize, create table as select (CTAS)

    create table tokens
    as 
    select
      ltrim(        -- ltrim() and rtrim() remove leading/trailing spaces (blanks)
        rtrim( 
          substr( N.wrapped
          , instr( N.wrapped, ',', 1, T.pos ) + 1
          , ( instr( N.wrapped, ',', 1, T.pos + 1 ) - instr( N.wrapped, ',', 1, T.pos ) ) - 1 
          ) 
        )
      ) token
    , N.id
    from (        
      select ',' || name1 || ',' as wrapped, 'T1_' || to_char( id_t1 ) as id from t1 -- names wrapped in commas, (table)_id
      union all
      select ',' || name2 || ',' , 'T2_' || to_char( id_t2 ) from t2  
    ) N join (  
      select level as pos   -- (max) possible position of char in an existing token
      from dual 
      connect by level <= (
        select greatest(    -- find the longest string ie max position (query T1 and T2) 
          ( select max( length( name1 ) ) from t1 )
        , ( select max( length( name2 ) ) from t2 )
        ) as pos
        from dual
      )  
    ) T
      on T.pos <= ( length( N.wrapped ) - length( replace( N.wrapped, ',') ) ) - 1 
    ;
    

    The inspiration to tokenize without using CONNECT BY came from this SO answer.

    The contents of the TOKENS table will look something like this:

    SQL> select * from tokens ;
    TOKEN                           ID       
    ASCORBIC ACID                   T1_1     
    SODIUM HYDROGEN CARBONATE       T1_2     
    CAFFEINE                        T1_3     
    PSEUDOEPHEDRINE HYDROCHLORIDE   T1_4     
    PARACETAMOL                     T1_100   
    sodium hydroxide                T1_110   
    POTASSIUM HYDROGEN CARBONATE    T2_4     
    SODIUM HYDROGEN CARBONATE       T2_5     
    PARACETAMOL PH. EUR.            T2_6     
    CODEINE PHOSPHATE               T2_7     
    DEXCHLORPHENIRAMINE MALEATE     T2_8     
    DEXCHLORPHENIRAMINE MALEATE     T2_10    
    PARACETAMOL                     T2_200 
    ...
    

    {2} GROUP BY, LISTAGG, self join

    select
      S1.id id1
    , S2.id id2
    , S1.tokengroup_T1
    , S2.tokengroup_T2
    from 
    (
      select substr( id, 4, length( id ) - 3 ) id
      , listagg( token, ' + ' ) within group ( order by token ) tokengroup_T1
      from tokens
      group by id 
      having substr( id, 1, 3 ) = 'T1_'
    ) S1 
      join 
    (
      select substr( id, 4, length( id ) - 3 ) id
      , listagg( token, ' + ' ) within group ( order by token ) tokengroup_T2
      from tokens
      group by id 
      having substr( id, 1, 3 ) = 'T2_'
    ) S2 
      on S1.tokengroup_T1 = S2.tokengroup_T2
    ;
    
    -- result
    ID1   ID2   TOKENGROUP_T1                                                 TOKENGROUP_T2                                                 
    4     10    DEXCHLORPHENIRAMINE MALEATE + PSEUDOEPHEDRINE HYDROCHLORIDE   DEXCHLORPHENIRAMINE MALEATE + PSEUDOEPHEDRINE HYDROCHLORIDE   
    110   210   potassium carbonate + sodium hydroxide                        potassium carbonate + sodium hydroxide                        
    1     4     ASCORBIC ACID + PARACETAMOL + POTASSIUM HYDROGEN CARBONATE    ASCORBIC ACID + PARACETAMOL + POTASSIUM HYDROGEN CARBONATE    
    3     6     CAFFEINE + PARACETAMOL PH. EUR.                               CAFFEINE + PARACETAMOL PH. EUR. 
    

    When doing things this way, you can get the substances into (alphabetical) order, and you can also pick a "delimiter" that you like (we have used '+') here.

    ALTERNATIVE

    If all that is no use to you, or you think this is too complicated, then you could try using TRANSLATE(). In this case, I'd recommend stripping all spaces/blanks from your dataset (in a query - not altering the original data!) like so:

    Query

    select 
      id1, id2
    , name1, name2
    from (
      select 
        id_t1 id1
      , id_t2 id2
      , T1.name1 name1
      , T2.name2 name2
      from T1
        join T2 
          on  translate( replace( T1.name1, ' ', '' ), replace( T2.name2, ' ', '' ), '!' )
            = translate( replace( T2.name2, ' ', '' ), replace( T1.name1, ' ', '' ), '!' )
    ) ;
    

    Result

      ID1   ID2 NAME1                                                                NAME2                                                        
        2     5 SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS, CITRIC ACID   SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS        
        3     6 CAFFEINE, PARACETAMOL PH. EUR.                                       PARACETAMOL PH. EUR.,CAFFEINE                                
      100    10 PARACETAMOL, DEXTROMETHORPHAN, PSEUDOEPHEDRINE, PYRILAMINE           DEXCHLORPHENIRAMINE MALEATE, PSEUDOEPHEDRINE HYDROCHLORIDE   
      110   210 sodium hydroxide, potassium carbonate                                sodium hydroxide, potassium carbonate
    

    NOTE: I've added the following rows to your sample data:

    -- T1
    110, 'sodium hydroxide, potassium carbonate'
    
    -- T2
    210, 'sodium hydroxide, potassium carbonate' 
    211, 'potassium hydroxide, sodium carbonate'
    

    I found that it is easy to use TRANSLATE() in a way that gives you "false positives" ie the substances with ids 110, 210, and 211 will appear to "match". (In other words: I don't think this is the right tool for this job.)

    DBFIDDLE here

    (follow the link to see the sample tables and queries).