I have an array of items (~5000 items, each item is an English word) and a distance function between pairs of items. I want to find groups of items within the array where all the items within a group satisfy a distance criterion (e.g. every pair of items have a distance smaller than 2). The groups should generally be as large as possible, but there's no formal definition or hard requirement for this.
My implementation language is PHP, but I'm looking for general advice regarding algorithms that can handle this efficiently.
Update: I think I can solve this by building a graph where the vertices are the items, and there's an edge between items that satisfy the distance constraint. Once I build the graph I can run an algorithm like Bron–Kerbosch to list all the maximal cliques. I'll update if this works out but feel free to add your thoughts in the meantime.
How is this function defined? Is it pre-computed? If so you can iterate over the computed function representation and retrieve pairs of words based on your criterion. If not, you have no other option but to compute the function between all pairs of words (This is necessary in your graph approach). Instead of the Bron-Kerbosch algorithm, I would go for a randomization+approximation strategy like,
http://dimacs.rutgers.edu/Workshops/Challenge10/abstracts.html#ovelgonne
http://dl.acm.org/citation.cfm?id=1933306.1934471
It is based on an modularity maximization approach. Modularity is the ratio between number of outgoing edges in a cluster to the number of edges within the cluster. In your case, you will be looking for the clusters with ratio 0, and choose the biggest one of them. This algorithm is really fast and works for most datasets that I have worked with. Though a modularity based method might be overkill for this problem, IMHO I feel this is a nice way to think about the problem + the implementation of this algorithm is available online (C implementation by the authors of the paper).