Search code examples
algorithmoptimizationocamlintersectionset-intersection

Fast intersection of two N^4 sets


I'm looking for an algorithm that returns (the length of) the intersection of two given N^4 lists / sets (more precisely in [|0;7|]^4). I'm not expecting an answer in a specific programming language (I'm doing it in OCaml but as this language isn't very popular I don't expect answers to be written in Caml...). Note that solutions involving big pre-computations (for example to completly change the structure of sets) aren't a problem in my case.

I tried to consider N^4 sets as classical integer sets to then intersect them using buit-in set intersections (but this was way too slow for my purpose) and I also tried to consider them as N^2 sets to apply the Lebesgue fast N^2 intersect (this could have worked but it turns out points don't rearrange very randomly over the plane thus making this algorithm quite inefficient).


Solution

  • It is not clear what performance target you are aiming for. Normal set intersection with at most 4096 elements should be not that slow at least for an intersection on generic sets. Typically, the worst case scenario (intersecting the full set with itself) takes around 70µs with OCaml standard set on my computer whereas the best case scenario (intersecting two empty sets) takes 6ns. In particular, computing only the length in O(n) and avoiding to build the result set doesn't seem to improve the performance in the worst case scenario.

    Thus, we are looking at optimizing constants by fitting the algorithm to the data set. Since, your sets have at most 4096 elements, they can be represented as strings of length 4096/8 = 512. With this representation, intersecting two sets is a matter of taking the logical and of the two strings. We can optimize further by computing the length on the fly:

     let inter x y =
        let r = ref 0 in
        for i = 0 to set_size - 1 do
          r := !r + popcount (Char.code x.[i] land Char.code y.[i])
        done;
        !r
    

    This is enough to decrease the computation time to 200 ns with an OCaml implementation of popcount (which counts the number of 1 bit in the character).

    Going further is possible, but at that point we need to either exploit more structure of the data set or we need to fit the algorithm to the hardware by going down to C in order to use the hardware popcount instruction and vectorial SIMD instructions to compute the logical and on larger batch of bytes.