Search code examples
pythonpython-3.xpandasalgorithmcluster-analysis

How to create clusters/groups from knowing associations?


I have a dataframe that has 2 columns: [ID, ASSOCIATED_ID] For each ID, I have a list of other associated IDS from the dataframe. Here is a synthesized version of it:

ID            ASSOCIATED_ID
1             [2,3]
2             [1,4]
3             [1]
4             [2]
5             []

If I want to create clusters (groups) of IDs that are associated to each other (not necessary that they have a direct association but even if there is any transitive association). How can I do that programmatically?


Solution

  • IIUC,you can use networkx and connect_components:

    df_e = df.explode('ASSOCIATED_ID')
    
    G = nx.from_pandas_edgelist(df_e, 'ID','ASSOCIATED_ID')
    
    [i for i in nx.connected_components(G)]
    

    Output:

    [{1, 2, 3, 4}, {nan, 5}]