Let's say I have a list of lists of words, for example
[['apple','banana'],
['apple','orange'],
['banana','orange'],
['rice','potatoes','orange'],
['potatoes','rice']]
The set is much bigger. I want to cluster the words that words usually existing together will have the same cluster. So in this case the clusters will be ['apple', 'banana', 'orange']
and ['rice','potatoes']
.
What is the best approach to archive this kind of clustering?
So, after lots of Googling around, I figured out that I, in fact, can't use clustering techniques because I lack feature variables on which I can cluster the words. If I make a table where I note how often each word exists with other words (in fact cartesian product) is in fact adjacency matrix and clustering doesn't work well on it.
So, the solution I was looking for is graph community detection. I used igraph library (or the python's python-ipgraph wrapper) to find the clusters and it runs very well and fast.
More informations: