Search code examples
pythonpandasgraphgroup-byconnected-components

Pandas: Assign same cluster id to records based on common groups in different columns


I have a dataframe like this:

index A B C
1 111 222 111
2 111 222 222
3 111 111 555
4 222 222 444
5 222 333 111
6 222 444 333
7 333 555 777
8 444 666 777
df = pd.DataFrame({
  'A': [111,111,111,222,222,222,333,444], 
  'B': [222,222,111,222,333,444,555,666],
  'C': [111,222,555,444,111,333,777,777]
})

I want to create new column 'cluster' and assign same id to records which are connected directly or through common group in one of the columns.
Meaning, here for example, we see that first 3 elements connected by same group in 'A', but they also connected to other records which have same groups '222', '111' in column 'B'. And all records which have '111', '222', '555' in column 'C'.
So basically, all first 6 elements should have same cluster Id.

index A B C cluster
1 111 222 111 1
2 111 222 222 1
3 111 111 555 1
4 222 222 444 1
5 222 333 111 1
6 222 444 333 1
7 333 555 777 2
8 444 666 777 2

Records 4-6 are connected to 1-3 as they form a group in column A and they are connected to previous records through columns B and C.

I was playing with multiple consequent apply functions on pairs of columns, but now thinking of applying connected components here, but can't figure out how to do that.

Also, the main problem is that this dataset is huge, > 30 000 000 records.

Appreciate any help.


Solution

  • You can indeed achieve this using networkx.connected_components, after reshaping your DataFrame to build the consecutive pairs of edges. This won't be so fast with 33M rows though:

    import networkx as nx
    
    tmp = (df
       .melt(ignore_index=False, value_name='source')
       .assign(target=lambda d: d.groupby(level=0)['source'].shift())
       .dropna(subset=['source', 'target'])
       .drop_duplicates(subset=['source', 'target'])
    )
    
    G = nx.from_pandas_edgelist(tmp)
    
    groups = {n: next(iter(g)) for g in nx.connected_components(G) for n in g}
    
    df['cluster'] = df['A'].map(groups)
    

    Output:

         A   B    C cluster
    0  111  02  001     002
    1  111  02  002     002
    2  111  01  005     002
    3  222  02  004     002
    4  222  03  001     002
    5  222  04  003     002
    

    Graph:

    connected components

    considering identical values in different columns as different nodes:

    There are several options, an easy one is to use strings and to prefix the column name in the node

    tmp1 = df.astype(str).radd([f'{c}|' for c in df])
    
    tmp = (tmp1
       .melt(ignore_index=False, value_name='source')
       .assign(target=lambda d: d.groupby(level=0)['source'].shift())
       .dropna(subset=['source', 'target'])
       .drop_duplicates(subset=['source', 'target'])
    )
    
    G = nx.from_pandas_edgelist(tmp)
    
    groups = {n: i for i, g in enumerate(nx.connected_components(G), start=1)
              for n in g}
    
    df['cluster'] = tmp1['A'].map(groups)
    

    Output:

         A    B    C  cluster
    0  111  222  111        1
    1  111  222  222        1
    2  111  111  555        1
    3  222  222  444        1
    4  222  333  111        1
    5  222  444  333        1
    6  333  555  777        2
    7  444  666  777        2
    

    Graph:

    enter image description here