Bloom filters seem promising for a real-world problem I'm working on. A popular java implementation seems to be Google's guava library.
I need to use bloom filter's union operation, and an intersection operation would be good to have, but I can work around it.
In the java docs, I could find no method which performs intersection, while the putAll method seems to work like the union operation.
So my questions are these:
If someone could recommend any another popular and well-tested library which is available on maven's repository,and has intersection and union, that'd also solve my needs.
- Is putAll() the correct way to get the union of two bloom filters?
Yes, BloomFilter#putAll(BloomFilter)
JavaDoc says:
Combines this Bloom filter with another Bloom filter by performing a bitwise OR of the underlying data. The mutations happen to this instance. Callers must ensure the Bloom filters are appropriately sized to avoid saturating them.
The catch is that it'll throw IllegalArgumentException - if isCompatible(that) == false
, so if you want to use this method you must make sure that isCompatible
returns true:
Determines whether a given Bloom filter is compatible with this Bloom filter. For two Bloom filters to be compatible, they must:
- not be the same instance
- have the same number of hash functions
- have the same bit size
- have the same strategy
- have equal funnels
That said, let's try to answer the second question and find an easier solution.
- Is there a non-reflective way to get the intersection of two or more bloom filters?
Yes, you can use the fact that BloomFilter
s implement Predicate
and use its .and()
and .or()
methods, remembering that mightContain
behavior depends on BloomFilter's parameters.
Example:
// bloomFilter1: [0, 100], expected size 200, FPP 1%
BloomFilter<Integer> bloomFilter1 = IntStream.rangeClosed(0, 100)
.boxed()
.collect(toBloomFilter(Funnels.integerFunnel(), 200L, 0.01d));
// bloomFilter2: [80, 200], expected size 200, FPP 1%
BloomFilter<Integer> bloomFilter2 = IntStream.rangeClosed(80, 200)
.boxed()
.collect(toBloomFilter(Funnels.integerFunnel(), 200L, 0.01d));
// bloomFilter2: [80, 200], expected size 1000, FPP 0.1%
BloomFilter<Integer> bloomFilter3 = IntStream.rangeClosed(80, 200)
.boxed()
.collect(toBloomFilter(Funnels.integerFunnel(), 1000L, 0.001d));
assertThat(bloomFilter1.isCompatible(bloomFilter2)).isTrue(); // same parameters
assertThat(bloomFilter1.isCompatible(bloomFilter3)).isFalse(); // different parameters
BloomFilter<Integer> bloomFilterUnion = bloomFilter1.copy(); // preserves BF parameters
bloomFilterUnion.putAll(bloomFilter2);
// note that bloomFilterUnion.putAll(bloomFilter3) would throw IAE
// using AssertJ Predicate asserts https://joel-costigliola.github.io/assertj/core-8/api/org/assertj/core/api/PredicateAssert.html
// or == union == putAll
assertThat(bloomFilterUnion).accepts(90);
assertThat(bloomFilter1.or(bloomFilter2)).accepts(90);
assertThat(bloomFilter1.or(bloomFilter3)).accepts(90);
assertThat(bloomFilterUnion).accepts(42);
assertThat(bloomFilter1.or(bloomFilter2)).accepts(42);
assertThat(bloomFilter1.or(bloomFilter3)).accepts(42);
assertThat(bloomFilterUnion).accepts(180);
assertThat(bloomFilter1.or(bloomFilter2)).accepts(180);
assertThat(bloomFilter1.or(bloomFilter3)).accepts(180);
assertThat(bloomFilterUnion).rejects(300);
assertThat(bloomFilter1.or(bloomFilter2)).rejects(300);
assertThat(bloomFilter1.or(bloomFilter3)).rejects(300);
// and == intersection
assertThat(bloomFilter1.and(bloomFilter2)).accepts(90);
assertThat(bloomFilter1.and(bloomFilter3)).accepts(90);
assertThat(bloomFilter1.and(bloomFilter2)).rejects(42);
assertThat(bloomFilter1.and(bloomFilter3)).rejects(42);
assertThat(bloomFilter1.and(bloomFilter2)).rejects(180);
assertThat(bloomFilter1.and(bloomFilter3)).rejects(180);
assertThat(bloomFilter1.and(bloomFilter2)).rejects(300);
assertThat(bloomFilter1.and(bloomFilter3)).rejects(300);
- In case one is okay with reflection, can we safely perform biwise operations (or, and) on the 'data' field?
I personally wouldn't mess with internals of any class, but here IMO there's no need to do so.