python machine-learning cluster-analysis information-retrieval

List of lists of words clustering

Let's say I have a list of lists of words, for example

[['apple','banana'],
 ['apple','orange'],
 ['banana','orange'],
 ['rice','potatoes','orange'],
 ['potatoes','rice']]

The set is much bigger. I want to cluster the words that words usually existing together will have the same cluster. So in this case the clusters will be ['apple', 'banana', 'orange'] and ['rice','potatoes'].
What is the best approach to archive this kind of clustering?

Solution

So, after lots of Googling around, I figured out that I, in fact, can't use clustering techniques because I lack feature variables on which I can cluster the words. If I make a table where I note how often each word exists with other words (in fact cartesian product) is in fact adjacency matrix and clustering doesn't work well on it.

So, the solution I was looking for is graph community detection. I used igraph library (or the python's python-ipgraph wrapper) to find the clusters and it runs very well and fast.

More informations:

similar question: https://stats.stackexchange.com/questions/142297/finding-natural-groups-clusters-in-an-undirected-graph-over-several-undirect
Community detection in graphs paper: https://arxiv.org/pdf/0906.0612.pdf
Basic descrption of various algorithms: What are the differences between community detection algorithms in igraph?