Search code examples

Use pandas to cluster in a dataframe

I need help in order to deal with pandas and tab Here is a tab :

Col1    Col2
A   B
C   B
D   B
E   F
G   F
F   A
Z   Y
H   Y
L   P

From this tab I would like to create clusters and get a new tab such as:

Cluster Names
Cluster1    A
Cluster1    B
Cluster1    C
Cluster1    D
Cluster1    F
Cluster1    E
Cluster1    G
Cluster2    Z
Cluster2    Y
Cluster2    H
Cluster3    L
Cluster3    P

As you can see the letters A B C D E F and G are in the Cluster1 because they all have something in common.

`A` and `B` are in the same line (A and B forme the `Cluster1`)
`C` and `B` are in the same line (C includes the `Cluster1`)
`D` and `B` are in the same line (D includes the `Cluster1`)
`F` and `A` are in the same line (F includes the `Cluster1`)
`E` and `F` are in the same line (E includes the `Cluster1`)
`G` and `F` are in the same line (G includes the `Cluster1`)

`Z` and `Y` are in the same line (Z and Y create the `Cluster2`)
`H` and `Y` are in the same line (H includes the `Cluster2`)

`L` and `P` are in the same line (L and P create the `Cluster3`)

Does someone have an idea using pandas ?


  • This is a graph problem know as connected components, I suggest you use networkx.connected_components:

    import networkx as nx
    g = nx.from_pandas_edgelist(df, source='Col1', target='Col2', create_using=nx.Graph)
    for component in nx.connected_components(g):


    {'E', 'G', 'C', 'D', 'F', 'A', 'B'}
    {'Y', 'H', 'Z'}
    {'L', 'P'}

    Notice that the components match the groups of your output. To convert it to a DataFrame do the following:

    data = [[f'Cluster{i}', element] for i, component in enumerate(nx.connected_components(g), 1) for element in component]
    result = pd.DataFrame(data=data, columns=['Cluster', 'Names'])


         Cluster Names
    0   Cluster1     D
    1   Cluster1     A
    2   Cluster1     B
    3   Cluster1     G
    4   Cluster1     C
    5   Cluster1     F
    6   Cluster1     E
    7   Cluster2     Z
    8   Cluster2     Y
    9   Cluster2     H
    10  Cluster3     L
    11  Cluster3     P