I have following dataframe:
df = pd.DataFrame({"zip":['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A','C', 'C', 'C','C', 'C', 'C'],
"zip_splitted":['A_1', 'A_1', 'A_1','A_1', 'A_1', 'A_1',
'A_2', 'A_2', 'A_2','A_2', 'A_2', 'A_2',
'C_1', 'C_1', 'C_1', 'C_1', 'C_1', 'C_1'],
"cluster":['111', '111', '111', '112', '112', '112',
'113', '113', '113', '114', '114', '114',
'115', '115', '115', '116', '116', '116'],
"cluster2":['991', '991', '994', '991', '882', '991',
'993', '991', '994', '992', '991', '991',
'889', '889', '992', '998', '997', '999']
})
zip | zip_splitted | cluster | cluster2 |
---|---|---|---|
A | A_1 | 111 | 991 |
A | A_1 | 111 | 991 |
A | A_1 | 111 | 994 |
A | A_1 | 112 | 991 |
A | A_1 | 112 | 882 |
A | A_1 | 112 | 991 |
A | A_2 | 113 | 993 |
A | A_2 | 113 | 991 |
A | A_2 | 113 | 994 |
A | A_2 | 114 | 992 |
A | A_2 | 114 | 991 |
A | A_2 | 114 | 991 |
C | C_1 | 115 | 889 |
C | C_1 | 115 | 889 |
C | C_1 | 115 | 992 |
C | C_1 | 116 | 998 |
C | C_1 | 116 | 997 |
C | C_1 | 116 | 999 |
Main target is to re-assign values in 'cluster', so that if groups of values in 'cluster' have at least 1 same value in 'cluster2' value - they should be combined with same cluster id.
For current case output should be following (cluster 111,112,113,114,115,116 - have at least one common vlaue in cluster2, re-assign them to 111, 116 - keeps the same):
zip | zip_splitted | cluster | cluster2 | cluster_new |
---|---|---|---|---|
A | A_1 | 111 | 991 | 111 |
A | A_1 | 111 | 991 | 111 |
A | A_1 | 111 | 994 | 111 |
A | A_1 | 112 | 991 | 111 |
A | A_1 | 112 | 882 | 111 |
A | A_1 | 112 | 991 | 111 |
A | A_2 | 113 | 993 | 111 |
A | A_2 | 113 | 991 | 111 |
A | A_2 | 113 | 994 | 111 |
A | A_2 | 114 | 992 | 111 |
A | A_2 | 114 | 991 | 111 |
A | A_2 | 114 | 991 | 111 |
C | C_1 | 115 | 889 | 111 |
C | C_1 | 115 | 889 | 111 |
C | C_1 | 115 | 992 | 111 |
C | C_1 | 116 | 998 | 116 |
C | C_1 | 116 | 997 | 116 |
C | C_1 | 116 | 999 | 116 |
Currently stuck with groupbys, trying to create a list for mapping, but not sure it's a correct way.
Appreciate any help.
Try this:
import networkx as nx
L = df.groupby('cluster')['cluster2'].agg(list)
G=nx.Graph()
for l in L:
nx.add_path(G, l)
df = (df.assign(cluster_new = df.groupby(df['cluster2'].map(
{v2:k for k,v in enumerate(list(nx.connected_components(G))) for v2 in v}
))['cluster'].transform('first')))
Output:
zip zip_splitted cluster cluster2 cluster_new
0 A A_1 111 991 111
1 A A_1 111 991 111
2 A A_1 111 994 111
3 A A_1 112 991 111
4 A A_1 112 882 111
5 A A_1 112 991 111
6 A A_2 113 993 111
7 A A_2 113 991 111
8 A A_2 113 994 111
9 A A_2 114 992 111
10 A A_2 114 991 111
11 A A_2 114 991 111
12 C C_1 115 889 111
13 C C_1 115 889 111
14 C C_1 115 992 111
15 C C_1 116 998 116
16 C C_1 116 997 116
17 C C_1 116 999 116