Search code examples
pythongroupingnetworkxmultilabel-classification

Creating a unique id through indirect relation python


My question is that creating a uniqe code for the data which has direct relation, and through that direct relation other indirect relation is created for the first and last one. All in all if there such a case I want to assign a unique for the one which has direct and indirect realtion. Below you may find a basic dataset.

Product References  Unique Key
A   1   A1
A   2   A1
A   3   A1
A   a   A1
A   b   A1
A   c   A1
B   1   A1
B   4   A1
B   5   A1
B   x   A1
B   y   A1
B   z   A1
C   6   A1
C   7   A1
C   8   A1
C   x   A1
C   f   A1
C   h   A1
D   12  A2
D   13  A2
D   14  A2
D   15  A2
D   16  A2
D   17  A2
D   18  A2
F   13  A2
F   AB  A2
F   FF  A2
F   NB  A2
F   45  A2
F   63  A2
F   100 A2
F   98  A2
F   FGA A2
F   CA  A2

As you may see A has A relation with B but not with C. But B has a relation with C. In this case actually A and C is also related. And the en d of the dat Iwant to assing a key to those products.

And you may consider data consists of millions of row and each time a new relation is found a new iteration should run.

Explain problem, knowledge request, coding and library recommandations. How to create algortihm.


Solution

  • With networkx, you can use nx.connected_components and map:

    import networkx as nx
    
    # create graph
    G = nx.from_pandas_edgelist(df, source='Product', target='References')
    
    # find connected components, map unique key per group
    keys = {n: f'A{i}' for i, s in enumerate(nx.connected_components(G), start=1)
            for n in s}
    
    # map unique key based on product group
    df['Unique Key'] = df['Product'].map(keys)
    

    To improve memory efficiency, you can also pre-filter the nodes:

    # only keep nodes that connect to 2 products or more
    G = nx.from_pandas_edgelist(df[df['References'].duplicated(keep=False)], source='Product', target='References')
    
    products = set(df['Product'])
    G.add_nodes_from(products) # ensure all products are in the graph
    keys = {n: f'A{i}' for i, s in enumerate(nx.connected_components(G), start=1)
            for n in s&products}
    
    df['Unique Key'] = df['Product'].map(keys)
    

    Output:

       Product References Unique Key
    0        A          1         A1
    1        A          2         A1
    2        A          3         A1
    3        A          a         A1
    4        A          b         A1
    5        A          c         A1
    6        B          1         A1
    7        B          4         A1
    8        B          5         A1
    9        B          x         A1
    10       B          y         A1
    11       B          z         A1
    12       C          6         A1
    13       C          7         A1
    14       C          8         A1
    15       C          x         A1
    16       C          f         A1
    17       C          h         A1
    18       D         12         A2
    19       D         13         A2
    20       D         14         A2
    21       D         15         A2
    22       D         16         A2
    23       D         17         A2
    24       D         18         A2
    25       F         13         A2
    26       F         AB         A2
    27       F         FF         A2
    28       F         NB         A2
    29       F         45         A2
    30       F         63         A2
    31       F        100         A2
    32       F         98         A2
    33       F        FGA         A2
    34       F         CA         A2
    

    The graph:

    enter image description here

    Simplified graph in the optimized version:

    enter image description here