python grouping networkx multilabel-classification

Creating a unique id through indirect relation python

My question is that creating a uniqe code for the data which has direct relation, and through that direct relation other indirect relation is created for the first and last one. All in all if there such a case I want to assign a unique for the one which has direct and indirect realtion. Below you may find a basic dataset.

Product References  Unique Key
A   1   A1
A   2   A1
A   3   A1
A   a   A1
A   b   A1
A   c   A1
B   1   A1
B   4   A1
B   5   A1
B   x   A1
B   y   A1
B   z   A1
C   6   A1
C   7   A1
C   8   A1
C   x   A1
C   f   A1
C   h   A1
D   12  A2
D   13  A2
D   14  A2
D   15  A2
D   16  A2
D   17  A2
D   18  A2
F   13  A2
F   AB  A2
F   FF  A2
F   NB  A2
F   45  A2
F   63  A2
F   100 A2
F   98  A2
F   FGA A2
F   CA  A2

As you may see A has A relation with B but not with C. But B has a relation with C. In this case actually A and C is also related. And the en d of the dat Iwant to assing a key to those products.

And you may consider data consists of millions of row and each time a new relation is found a new iteration should run.

Explain problem, knowledge request, coding and library recommandations. How to create algortihm.

Solution

With networkx, you can use nx.connected_components and map:

import networkx as nx

# create graph
G = nx.from_pandas_edgelist(df, source='Product', target='References')

# find connected components, map unique key per group
keys = {n: f'A{i}' for i, s in enumerate(nx.connected_components(G), start=1)
        for n in s}

# map unique key based on product group
df['Unique Key'] = df['Product'].map(keys)

To improve memory efficiency, you can also pre-filter the nodes:

# only keep nodes that connect to 2 products or more
G = nx.from_pandas_edgelist(df[df['References'].duplicated(keep=False)], source='Product', target='References')

products = set(df['Product'])
G.add_nodes_from(products) # ensure all products are in the graph
keys = {n: f'A{i}' for i, s in enumerate(nx.connected_components(G), start=1)
        for n in s&products}

df['Unique Key'] = df['Product'].map(keys)

Output:

   Product References Unique Key
0        A          1         A1
1        A          2         A1
2        A          3         A1
3        A          a         A1
4        A          b         A1
5        A          c         A1
6        B          1         A1
7        B          4         A1
8        B          5         A1
9        B          x         A1
10       B          y         A1
11       B          z         A1
12       C          6         A1
13       C          7         A1
14       C          8         A1
15       C          x         A1
16       C          f         A1
17       C          h         A1
18       D         12         A2
19       D         13         A2
20       D         14         A2
21       D         15         A2
22       D         16         A2
23       D         17         A2
24       D         18         A2
25       F         13         A2
26       F         AB         A2
27       F         FF         A2
28       F         NB         A2
29       F         45         A2
30       F         63         A2
31       F        100         A2
32       F         98         A2
33       F        FGA         A2
34       F         CA         A2

The graph:

Simplified graph in the optimized version: