Search code examples
pythonpandasgroup-by

Pandas: re-assign values for groups to same value in one column if they have at least one common in another


I have following dataframe:

df = pd.DataFrame({"zip":['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A','C', 'C', 'C','C', 'C', 'C'],
                   "zip_splitted":['A_1', 'A_1', 'A_1','A_1', 'A_1', 'A_1', 
                                   'A_2', 'A_2', 'A_2','A_2', 'A_2', 'A_2', 
                                   'C_1', 'C_1', 'C_1', 'C_1', 'C_1', 'C_1'], 
                   "cluster":['111', '111', '111', '112', '112', '112', 
                              '113', '113', '113', '114', '114', '114', 
                              '115', '115', '115', '116', '116', '116'],                  
                   "cluster2":['991', '991', '994', '991', '882', '991', 
                               '993', '991', '994', '992', '991', '991',
                              '889', '889', '992', '998', '997', '999']
                  })
zip zip_splitted cluster cluster2
A A_1 111 991
A A_1 111 991
A A_1 111 994
A A_1 112 991
A A_1 112 882
A A_1 112 991
A A_2 113 993
A A_2 113 991
A A_2 113 994
A A_2 114 992
A A_2 114 991
A A_2 114 991
C C_1 115 889
C C_1 115 889
C C_1 115 992
C C_1 116 998
C C_1 116 997
C C_1 116 999

Main target is to re-assign values in 'cluster', so that if groups of values in 'cluster' have at least 1 same value in 'cluster2' value - they should be combined with same cluster id.

For current case output should be following (cluster 111,112,113,114,115,116 - have at least one common vlaue in cluster2, re-assign them to 111, 116 - keeps the same):

zip zip_splitted cluster cluster2 cluster_new
A A_1 111 991 111
A A_1 111 991 111
A A_1 111 994 111
A A_1 112 991 111
A A_1 112 882 111
A A_1 112 991 111
A A_2 113 993 111
A A_2 113 991 111
A A_2 113 994 111
A A_2 114 992 111
A A_2 114 991 111
A A_2 114 991 111
C C_1 115 889 111
C C_1 115 889 111
C C_1 115 992 111
C C_1 116 998 116
C C_1 116 997 116
C C_1 116 999 116

Currently stuck with groupbys, trying to create a list for mapping, but not sure it's a correct way.
Appreciate any help.


Solution

  • Try this:

    import networkx as nx
    
    L = df.groupby('cluster')['cluster2'].agg(list)
    G=nx.Graph()
    for l in L:
        nx.add_path(G, l)
    df = (df.assign(cluster_new = df.groupby(df['cluster2'].map(
        {v2:k for k,v in enumerate(list(nx.connected_components(G))) for v2 in v}
        ))['cluster'].transform('first')))
    

    Output:

       zip zip_splitted cluster cluster2 cluster_new
    0    A          A_1     111      991         111
    1    A          A_1     111      991         111
    2    A          A_1     111      994         111
    3    A          A_1     112      991         111
    4    A          A_1     112      882         111
    5    A          A_1     112      991         111
    6    A          A_2     113      993         111
    7    A          A_2     113      991         111
    8    A          A_2     113      994         111
    9    A          A_2     114      992         111
    10   A          A_2     114      991         111
    11   A          A_2     114      991         111
    12   C          C_1     115      889         111
    13   C          C_1     115      889         111
    14   C          C_1     115      992         111
    15   C          C_1     116      998         116
    16   C          C_1     116      997         116
    17   C          C_1     116      999         116