My dataframe looks like this:
v1 v2 distance
0 be belong 0.666667
4 increase decrease 0.666667
9 analyze assay 0.666667
11 bespeak circulate 0.769231
21 induce generate 0.800000
24 decrease delay 0.750000
26 cause trip 0.666667
27 isolate distinguish 0.750000
28 give infect 0.666667
29 result prove 0.800000
31 describe explain 0.714286
33 report circulate 0.666667
36 affect expose 0.666667
40 explain intercede 0.705882
41 suppress restrict 0.833333
With v1
and v2
being verbs and distance
is their similarity. I want to create clusters of similar words, based on their appearance in the dataframe.
For example, the word circulate appears be similar with both bespeak and report. So I would like to have a cluster of these 3 words. Groupby doesn't help since they are string values. Can someone help?
This seems like a graph problem.
You could try to use networkx
:
import networkx as nx
G = nx.from_pandas_edgelist(df, 'v1', 'v2')
clusters = nx.connected_components(G)
output:
[{'be', 'belong'}, {'delay', 'increase', 'decrease'}, {'analyze', 'assay'},
{'report', 'bespeak', 'circulate'}, {'induce', 'generate'}, {'trip', 'cause'},
{'distinguish', 'isolate'}, {'infect', 'give'}, {'prove', 'result'},
{'intercede', 'describe', 'explain'}, {'affect', 'expose'}, {'restrict', 'suppress'}]
As graph:
Small function to plot the graph in jupyter:
def nxplot(G):
from networkx.drawing.nx_agraph import to_agraph
A = to_agraph(G)
A.layout('dot')
A.draw('/tmp/graph.png')
from IPython.display import Image
return Image(filename='/tmp/graph.png')