I need help in order to deal with pandas and tab Here is a tab :
Col1 Col2
A B
C B
D B
E F
G F
F A
Z Y
H Y
L P
From this tab I would like to create clusters and get a new tab such as:
Cluster Names
Cluster1 A
Cluster1 B
Cluster1 C
Cluster1 D
Cluster1 F
Cluster1 E
Cluster1 G
Cluster2 Z
Cluster2 Y
Cluster2 H
Cluster3 L
Cluster3 P
As you can see the letters A B C D E F
and G
are in the Cluster1
because they all have something in common.
`A` and `B` are in the same line (A and B forme the `Cluster1`)
`C` and `B` are in the same line (C includes the `Cluster1`)
`D` and `B` are in the same line (D includes the `Cluster1`)
`F` and `A` are in the same line (F includes the `Cluster1`)
`E` and `F` are in the same line (E includes the `Cluster1`)
`G` and `F` are in the same line (G includes the `Cluster1`)
`Z` and `Y` are in the same line (Z and Y create the `Cluster2`)
`H` and `Y` are in the same line (H includes the `Cluster2`)
`L` and `P` are in the same line (L and P create the `Cluster3`)
Does someone have an idea using pandas ?
This is a graph problem know as connected components, I suggest you use networkx.connected_components:
import networkx as nx
g = nx.from_pandas_edgelist(df, source='Col1', target='Col2', create_using=nx.Graph)
for component in nx.connected_components(g):
print(component)
Output
{'E', 'G', 'C', 'D', 'F', 'A', 'B'}
{'Y', 'H', 'Z'}
{'L', 'P'}
Notice that the components match the groups of your output. To convert it to a DataFrame do the following:
data = [[f'Cluster{i}', element] for i, component in enumerate(nx.connected_components(g), 1) for element in component]
result = pd.DataFrame(data=data, columns=['Cluster', 'Names'])
print(result)
Output
Cluster Names
0 Cluster1 D
1 Cluster1 A
2 Cluster1 B
3 Cluster1 G
4 Cluster1 C
5 Cluster1 F
6 Cluster1 E
7 Cluster2 Z
8 Cluster2 Y
9 Cluster2 H
10 Cluster3 L
11 Cluster3 P