algorithm scala merge distributed-computing apache-flink

Efficient distributed algorithm to merge sets with common elements

I'm working on a distributed implementation of MinHash LSH on Flink and as a last step, I need to merge some clusters, identified as sets of elements similar among them.

So I have a distributed collection of sets as an input and I need an algorithm to efficiently merge sets with common elements. Given the computational model of Flink, the algorithm may be iterative and not necessarily map-reduce like.

Here an example:

from {{1{1,2}},{2,{2,3}},{3,{4,5},{4{1,27}}}} the result should be {1,2,3,27},{4,5} because the set #1,#2 and #4 have at least one element in common.

Solution

Here's an idea: Gelly, which is part of Flink, has a connected components finder. Make a graph with a node for each set element and edges connecting the elements of each set in the simplest possible manner, e.g. for {a, b, c, d, ...} add [a,b], [a,c], [a,d], [a,... . Now find the connected components. Their nodes give the sets you're looking for.

Edit If you are worried about the performance impact of converting from sets to graphs and back (although this worry is premature optimization; you should try it), it would be simple enough to re-implement Gelly's token-pushing scheme over sets. Here's how this would work. You already have the tokens in your example: the set numbering. Let S[i] be set i shown in your example. E.g. S[1] = {1,2}. Let R be an inverse multimap that takes each set element to the set of sets it belongs to. E.g. R[2] = {1,2} in your example. Let T[i] be the elements reachable from set i by transitive non-null intersection "links". Then compute:

T[i] = S[i] for all i // with no links at all, a set reaches its own elements
loop
  for all i, Tnew[i] = \union_{ x \in T[i] } S[R[x]]  // add new reachables
  exit if Tnew == T
  T = Tnew
end loop

When done, the distinct values of map T are the answer you want. The max number of iterations ought to be log |U| where U is the universe of set elements.