python pandas graph group-by connected-components

Pandas: Assign same cluster id to records based on common groups in different columns

I have a dataframe like this:

index	A	B	C
1	111	222	111
2	111	222	222
3	111	111	555
4	222	222	444
5	222	333	111
6	222	444	333
7	333	555	777
8	444	666	777

df = pd.DataFrame({
  'A': [111,111,111,222,222,222,333,444], 
  'B': [222,222,111,222,333,444,555,666],
  'C': [111,222,555,444,111,333,777,777]
})

I want to create new column 'cluster' and assign same id to records which are connected directly or through common group in one of the columns.
Meaning, here for example, we see that first 3 elements connected by same group in 'A', but they also connected to other records which have same groups '222', '111' in column 'B'. And all records which have '111', '222', '555' in column 'C'.
So basically, all first 6 elements should have same cluster Id.

index	A	B	C	cluster
1	111	222	111	1
2	111	222	222	1
3	111	111	555	1
4	222	222	444	1
5	222	333	111	1
6	222	444	333	1
7	333	555	777	2
8	444	666	777	2

Records 4-6 are connected to 1-3 as they form a group in column A and they are connected to previous records through columns B and C.

I was playing with multiple consequent apply functions on pairs of columns, but now thinking of applying connected components here, but can't figure out how to do that.

Also, the main problem is that this dataset is huge, > 30 000 000 records.

Appreciate any help.

Solution

You can indeed achieve this using networkx.connected_components, after reshaping your DataFrame to build the consecutive pairs of edges. This won't be so fast with 33M rows though:

import networkx as nx

tmp = (df
   .melt(ignore_index=False, value_name='source')
   .assign(target=lambda d: d.groupby(level=0)['source'].shift())
   .dropna(subset=['source', 'target'])
   .drop_duplicates(subset=['source', 'target'])
)

G = nx.from_pandas_edgelist(tmp)

groups = {n: next(iter(g)) for g in nx.connected_components(G) for n in g}

df['cluster'] = df['A'].map(groups)

Output:

     A   B    C cluster
0  111  02  001     002
1  111  02  002     002
2  111  01  005     002
3  222  02  004     002
4  222  03  001     002
5  222  04  003     002

Graph:

considering identical values in different columns as different nodes:

There are several options, an easy one is to use strings and to prefix the column name in the node

tmp1 = df.astype(str).radd([f'{c}|' for c in df])

tmp = (tmp1
   .melt(ignore_index=False, value_name='source')
   .assign(target=lambda d: d.groupby(level=0)['source'].shift())
   .dropna(subset=['source', 'target'])
   .drop_duplicates(subset=['source', 'target'])
)

G = nx.from_pandas_edgelist(tmp)

groups = {n: i for i, g in enumerate(nx.connected_components(G), start=1)
          for n in g}

df['cluster'] = tmp1['A'].map(groups)

Output:

     A    B    C  cluster
0  111  222  111        1
1  111  222  222        1
2  111  111  555        1
3  222  222  444        1
4  222  333  111        1
5  222  444  333        1
6  333  555  777        2
7  444  666  777        2

Graph:

index	A	B	C
1	111	222	111
2	111	222	222
3	111	111	555
4	222	222	444
5	222	333	111
6	222	444	333
7	333	555	777
8	444	666	777

index	A	B	C	cluster
1	111	222	111	1
2	111	222	222	1
3	111	111	555	1
4	222	222	444	1
5	222	333	111	1
6	222	444	333	1
7	333	555	777	2
8	444	666	777	2

index	A	B	C
1	111	222	111
2	111	222	222
3	111	111	555
4	222	222	444
5	222	333	111
6	222	444	333
7	333	555	777
8	444	666	777

index	A	B	C	cluster
1	111	222	111	1
2	111	222	222	1
3	111	111	555	1
4	222	222	444	1
5	222	333	111	1
6	222	444	333	1
7	333	555	777	2
8	444	666	777	2

index	A	B	C
1	111	222	111
2	111	222	222
3	111	111	555
4	222	222	444
5	222	333	111
6	222	444	333
7	333	555	777
8	444	666	777

index	A	B	C	cluster
1	111	222	111	1
2	111	222	222	1
3	111	111	555	1
4	222	222	444	1
5	222	333	111	1
6	222	444	333	1
7	333	555	777	2
8	444	666	777	2