Search code examples
sqloracle-databaseoracle10gmdm

SQL combine multiple identifiers to create a group id for duplicate records


I'm working on a problem in Oracle that I'm struggling to solve 'elegantly'.

I have a data extract with three different identifiers: A, B, C

Each identifier may appear in more than one row, and each row may have one or more of these three identifiers (i.e the column is populated or null).

I want to be able to group all records that have any combination of either A, B or C in common and assign them the same group id.

Extract table showing what the eventual groups should be:

Rownum | A    | B    | C    | End group
1        p      NULL   NULL   1
2        p      r      NULL   1
3        q      NULL   NULL   2
4        NULL   r      NULL   1
5        NULL   NULL   s      2
6        q      NULL   s      2

My original approach was to assign a guid to each row in the extract and create a lookup table for the three identifiers:

GUID | IDENTIFIER | IDENTIFIER TYPE | GROUP | END GROUP
1      p            A                 1       1
2      p            A                 1       1
2      r            B                 2       1
3      q            A                 3       3
4      r            B                 2       1
5      s            C                 4       3
6      q            A                 3       3
6      s            C                 4       3

Then group by identifier and assign a group number. The groups, however, need to be combined where possible to provide the view shown in end group.

The only solution I can think of for this problem is to use loops, which I'd rather avoid.

Any ideas would be greatly appreciated.

Niall


Solution

  • This is truly an interesting problem. Still, I think we are missing a definition of a "group". Since in your example (p,null,null) (row1) and (null,r,null) (row4) share no common identifier and belong to the same group I'll go with this definition for grouping:

    A row belongs to a group if it shares at least one identifier with at least one row of this group.

    This means we can "chain" rows. This naturally leads to a hierarchical solution:

    SQL> SELECT ID, a, b, c, MIN(grp) grp
      2    FROM (SELECT connect_by_root(id) ID,
      3                 connect_by_root(a) a,
      4                 connect_by_root(b) b,
      5                 connect_by_root(c) c,
      6                 ID grp
      7             FROM a
      8           CONNECT BY NOCYCLE(PRIOR a = a
      9                           OR PRIOR b = b
     10                           OR PRIOR c = c))
     11   GROUP BY ID, a, b, c
     12   ORDER BY ID;
    
            ID A          B          C                 GRP
    ---------- ---------- ---------- ---------- ----------
             1 p                                         1
             2 p          r                              1
             3 q                                         3
             4            r                              1
             5                       s                   3
             6 q                     s                   3
    
    6 rows selected
    

    You can execute the subquery to understand the construction:

    SQL> SELECT connect_by_root(id) ID,
      2         connect_by_root(a) a,
      3         connect_by_root(b) b,
      4         connect_by_root(c) c,
      5         substr(sys_connect_by_path(ID, '->'), 3) path,
      6         ID grp
      7    FROM a
      8  CONNECT BY NOCYCLE(a = PRIOR a
      9                  OR b = PRIOR b
     10                  OR c = PRIOR c);
    
            ID A          B          C          PATH            GRP
    ---------- ---------- ---------- ---------- -------- ----------
             1 p                                1                 1
             1 p                                1->2              2
             1 p                                1->2->4           4
             2 p          r                     2                 2
             2 p          r                     2->1              1
             2 p          r                     2->4              4
             3 q                                3                 3
             3 q                                3->6              6
             3 q                                3->6->5           5
             4            r                     4                 4
             4            r                     4->2              2
             4            r                     4->2->1           1
             5                       s          5                 5
             5                       s          5->6              6
             5                       s          5->6->3           3
             6 q                     s          6                 6
             6 q                     s          6->3              3
             6 q                     s          6->5              5
    
    18 rows selected