Search code examples
data-structureslanguage-agnosticsetbit-manipulationsubset

Data structure for subset-reduced growing list


I'm working on a problem which involves going through a lot of data. To reduce the work (because current calculations take about two weeks of compute time, and I'd like to reduce that dramatically) I came up with an algorithm which would be much faster if it was able to avoid a certain type of duplication. (The current algorithm avoids storing this information because it is too large, unreduced, to fit in memory.)

I have a collection of sets, and I don't want to insert a set A if there is already a set B which is a subset of A. At the moment the sets are represented by integers where individual binary digits represent a particular element being present or absent. In that interpretation the set/integer A should not be inserted if there is already a set/integer B such that (~A) & B is 0, where ~ is bitwise negation and & is bitwise AND.

For example, if my collection has the following sets

[ {a,b}, {b,c}, {b,d,e} ]

and I asked to add {b,c,e} it should not be added (since {b,c} is already there) and similarly with {a,b} (since {a,b} is there) but {a,e} should be added.

The numeric equivalent would be starting with

[ `0b11`, `0b110`, `0b11010` ]

where 0b10110 is not added since (~0b10110) & 0b110 == 0, 0b11 is not added since (~0b11) ^ 0b11 == 0, but 0b10001 can be added.

Ideally the structure would prune itself as new sets are added, so if {c} were added all existing sets containing c would be removed. But it's acceptable if it doesn't update in that way as long as I can normalize it to that form in some not-too-expensive way every so often.


Solution

  • This is a well-known problem known as "finding extremal sets"; unfortunately, there is nothing fundamentally faster known than the obvious approach of testing a newly inserted set against all existing sets, but good heuristic improvements exist. Here is a recent paper discussing this problem: https://arxiv.org/abs/1508.01753

    An open-source implementation of a related algorithm: https://code.google.com/archive/p/google-extremal-sets/