Given a list of objects with multiple attributes I need to find the list of sets created by a union of all intersecting subsets.
Specifically these are Person objects, each with many attributes. I need to create a list of 'master' sets based on a handful of unique identifiers such as SSN, DLN, etc.
For instance, if Person A and Person B have the same SSN they create a set i. Then if Person B and C have the same DLN, they create a set ii. Person D and E have the same SSN but it (and all other identifiers) does not match any of the identifiers of Persons A, B or C. After merging all intersecting subsets I would end up with one set with Persons A,B,C and another set with Persons D, E.
Here is the psuedo-code for my solution. I am curious if anyone has already come up with a more efficient way of merging all possible intersecting sets. Keep in mind that the links between sets could be X Persons long (i.e. A matches B by SSN and B matches C by DLN and C matches D by SSN and D matches E by some other identifier would result in Persons A-E in one set). Also assume that the language this will be implemented in supports set operations.
bigSetList = array of all of the uniq Sets
fullyTested = false
while (bigSetList.size() > 1) or (fullyTested is false)
foreach thisSet in bigSetList order by size desc
if count(sets that intersect with thisSet) > 0
newThisSet = thisSet
intersectingSets = []
foreach testSet in bigSetList
if thisSet.intersects(testSet)
end if
end if
end foreach
fullyTested = true // have looped through every set in the list and found 0 intersect partners
To expand on my comment in the original post, you want to create a list of sets where each member of a given set shares at least one attribute with at least one other member of that set.
Naively, this can be solved either by finding all pairs that share an attribute and merging pairs together that have the same partner iteratively. This would be O(N^3) (N^2 for iterating over pairs, and up to N separate sets to determine membership).
You can also think of this problem as determining the connected component of a graph, where every object and every unique attribute value is a node; each object would be connected to each of its attribute values. Setting up that graph would take linear time, and you could determine the connected components in linear time with a breadth or depth first search.