Search code examples
pythonpandasnlpcluster-computing

Pandas dataframe groupby text value that occurs in two columns


My dataframe looks like this:

     v1           v2        distance
0   be          belong      0.666667
4   increase    decrease    0.666667
9   analyze     assay       0.666667
11  bespeak     circulate   0.769231
21  induce      generate    0.800000
24  decrease    delay       0.750000
26  cause       trip        0.666667
27  isolate     distinguish 0.750000
28  give        infect      0.666667
29  result      prove       0.800000
31  describe    explain     0.714286
33  report      circulate   0.666667
36  affect      expose      0.666667
40  explain     intercede   0.705882
41  suppress    restrict    0.833333

With v1 and v2 being verbs and distance is their similarity. I want to create clusters of similar words, based on their appearance in the dataframe.

For example, the word circulate appears be similar with both bespeak and report. So I would like to have a cluster of these 3 words. Groupby doesn't help since they are string values. Can someone help?


Solution

  • This seems like a graph problem.

    You could try to use networkx:

    import networkx as nx
    
    G = nx.from_pandas_edgelist(df, 'v1', 'v2')
    
    clusters = nx.connected_components(G)
    

    output:

    [{'be', 'belong'}, {'delay', 'increase', 'decrease'}, {'analyze', 'assay'},
     {'report', 'bespeak', 'circulate'}, {'induce', 'generate'}, {'trip', 'cause'},
     {'distinguish', 'isolate'}, {'infect', 'give'}, {'prove', 'result'},
     {'intercede', 'describe', 'explain'}, {'affect', 'expose'}, {'restrict', 'suppress'}]
    

    As graph:

    graph

    Small function to plot the graph in jupyter:

    def nxplot(G):
        from networkx.drawing.nx_agraph import to_agraph
        A = to_agraph(G)
        A.layout('dot')
        A.draw('/tmp/graph.png')
        from IPython.display import Image
        return Image(filename='/tmp/graph.png')