Search code examples
javadata-structuressetguavabloom-filter

Intersections and unions of Bloom filters in Java using the Guava library


Bloom filters seem promising for a real-world problem I'm working on. A popular java implementation seems to be Google's guava library.

I need to use bloom filter's union operation, and an intersection operation would be good to have, but I can work around it.

In the java docs, I could find no method which performs intersection, while the putAll method seems to work like the union operation.

So my questions are these:

  1. Is putAll() the correct way to get the union of two bloom filters?
  2. Is there a non-reflective way to get the intersection of two or more bloom filters?
  3. In case one is okay with reflection, can we safely perform biwise operations (or, and) on the 'data' field?

If someone could recommend any another popular and well-tested library which is available on maven's repository,and has intersection and union, that'd also solve my needs.


Solution

    1. Is putAll() the correct way to get the union of two bloom filters?

    Yes, BloomFilter#putAll(BloomFilter) JavaDoc says:

    Combines this Bloom filter with another Bloom filter by performing a bitwise OR of the underlying data. The mutations happen to this instance. Callers must ensure the Bloom filters are appropriately sized to avoid saturating them.

    The catch is that it'll throw IllegalArgumentException - if isCompatible(that) == false, so if you want to use this method you must make sure that isCompatible returns true:

    Determines whether a given Bloom filter is compatible with this Bloom filter. For two Bloom filters to be compatible, they must:

    • not be the same instance
    • have the same number of hash functions
    • have the same bit size
    • have the same strategy
    • have equal funnels

    That said, let's try to answer the second question and find an easier solution.

    1. Is there a non-reflective way to get the intersection of two or more bloom filters?

    Yes, you can use the fact that BloomFilters implement Predicate and use its .and() and .or() methods, remembering that mightContain behavior depends on BloomFilter's parameters.

    Example:

    // bloomFilter1: [0, 100], expected size 200, FPP 1%
    BloomFilter<Integer> bloomFilter1 = IntStream.rangeClosed(0, 100)
            .boxed()
            .collect(toBloomFilter(Funnels.integerFunnel(), 200L, 0.01d));
    
    // bloomFilter2: [80, 200], expected size 200, FPP 1%
    BloomFilter<Integer> bloomFilter2 = IntStream.rangeClosed(80, 200)
            .boxed()
            .collect(toBloomFilter(Funnels.integerFunnel(), 200L, 0.01d));
    
    // bloomFilter2: [80, 200], expected size 1000, FPP 0.1%
    BloomFilter<Integer> bloomFilter3 = IntStream.rangeClosed(80, 200)
            .boxed()
            .collect(toBloomFilter(Funnels.integerFunnel(), 1000L, 0.001d));
    
    assertThat(bloomFilter1.isCompatible(bloomFilter2)).isTrue(); // same parameters
    assertThat(bloomFilter1.isCompatible(bloomFilter3)).isFalse(); // different parameters
    
    BloomFilter<Integer> bloomFilterUnion = bloomFilter1.copy(); // preserves BF parameters
    bloomFilterUnion.putAll(bloomFilter2);
    // note that bloomFilterUnion.putAll(bloomFilter3) would throw IAE
    
    // using AssertJ Predicate asserts https://joel-costigliola.github.io/assertj/core-8/api/org/assertj/core/api/PredicateAssert.html
    
    // or == union == putAll
    assertThat(bloomFilterUnion).accepts(90);
    assertThat(bloomFilter1.or(bloomFilter2)).accepts(90);
    assertThat(bloomFilter1.or(bloomFilter3)).accepts(90);
    assertThat(bloomFilterUnion).accepts(42);
    assertThat(bloomFilter1.or(bloomFilter2)).accepts(42);
    assertThat(bloomFilter1.or(bloomFilter3)).accepts(42);
    assertThat(bloomFilterUnion).accepts(180);
    assertThat(bloomFilter1.or(bloomFilter2)).accepts(180);
    assertThat(bloomFilter1.or(bloomFilter3)).accepts(180);
    assertThat(bloomFilterUnion).rejects(300);
    assertThat(bloomFilter1.or(bloomFilter2)).rejects(300);
    assertThat(bloomFilter1.or(bloomFilter3)).rejects(300);
    
    // and == intersection
    assertThat(bloomFilter1.and(bloomFilter2)).accepts(90);
    assertThat(bloomFilter1.and(bloomFilter3)).accepts(90);
    assertThat(bloomFilter1.and(bloomFilter2)).rejects(42);
    assertThat(bloomFilter1.and(bloomFilter3)).rejects(42);
    assertThat(bloomFilter1.and(bloomFilter2)).rejects(180);
    assertThat(bloomFilter1.and(bloomFilter3)).rejects(180);
    assertThat(bloomFilter1.and(bloomFilter2)).rejects(300);
    assertThat(bloomFilter1.and(bloomFilter3)).rejects(300);
    
    1. In case one is okay with reflection, can we safely perform biwise operations (or, and) on the 'data' field?

    I personally wouldn't mess with internals of any class, but here IMO there's no need to do so.