I am using bloom filters to check for duplicated data in a set. However, there is a need to combine the results of two sets of data into a single filter to check for duplication across the two sets. I devised a function in pseudo-Python to perform this task:
def combine(a : bloom_filter, b : bloom_filter):
assert a.length == b.length
assert a.hashes == b.hashes
c = new bloom_filter(length = a.length, hashes = b.hashes)
c.attempts = a.attempts + b.attempts
c.bits = a.bits | b.bits
# Determining the amount of items
a_and_b = count(a & b)
a_not_b = count(a & !b)
not_a_b = count(!a & b)
neither = count(!a & !b)
c.item_count = a_not_b / a.length * a.item_count
+ not_a_b / b.length * b.item_count
+ a_and_b / c.length * min(a.item_count, b.item_count)
return c
Does this even sound correct? I am having considerable internal debate as to whether is is even possible to do what I intend, since much of the information about the source data is lost (which is the point of a bloom filter).
You can derive a formula for estimating the amount of items a Bloom Filter:
c = log(z / N) / ((h * log(1 - 1 / N))
N: Number of bits in the bit vector
h: Number of hashes
z: Number of zero bits in the bit vector
This provides a fairly accurate estimate of the number of items in the Bloom Filter. You can come up with an estimate for contribution with simple subtraction.