Search code examples
pythonmachine-learningcluster-analysisinformation-retrieval

List of lists of words clustering


Let's say I have a list of lists of words, for example

[['apple','banana'],
 ['apple','orange'],
 ['banana','orange'],
 ['rice','potatoes','orange'],
 ['potatoes','rice']]

The set is much bigger. I want to cluster the words that words usually existing together will have the same cluster. So in this case the clusters will be ['apple', 'banana', 'orange'] and ['rice','potatoes'].
What is the best approach to archive this kind of clustering?


Solution

  • So, after lots of Googling around, I figured out that I, in fact, can't use clustering techniques because I lack feature variables on which I can cluster the words. If I make a table where I note how often each word exists with other words (in fact cartesian product) is in fact adjacency matrix and clustering doesn't work well on it.

    So, the solution I was looking for is graph community detection. I used igraph library (or the python's python-ipgraph wrapper) to find the clusters and it runs very well and fast.

    More informations: