Search code examples
language-agnosticbloom-filter

Combining Bloom Filters


I am using bloom filters to check for duplicated data in a set. However, there is a need to combine the results of two sets of data into a single filter to check for duplication across the two sets. I devised a function in pseudo-Python to perform this task:

def combine(a : bloom_filter, b : bloom_filter):
    assert a.length == b.length
    assert a.hashes == b.hashes

    c = new bloom_filter(length = a.length, hashes = b.hashes)
    c.attempts = a.attempts + b.attempts
    c.bits = a.bits | b.bits

    # Determining the amount of items
    a_and_b = count(a & b)
    a_not_b = count(a & !b)
    not_a_b = count(!a & b)
    neither = count(!a & !b)
    c.item_count = a_not_b / a.length * a.item_count
                 + not_a_b / b.length * b.item_count
                 + a_and_b / c.length * min(a.item_count, b.item_count)

    return c

Does this even sound correct? I am having considerable internal debate as to whether is is even possible to do what I intend, since much of the information about the source data is lost (which is the point of a bloom filter).


Solution

  • You can derive a formula for estimating the amount of items a Bloom Filter:

    c = log(z / N) / ((h * log(1 - 1 / N))
    
    N: Number of bits in the bit vector
    h: Number of hashes
    z: Number of zero bits in the bit vector
    

    This provides a fairly accurate estimate of the number of items in the Bloom Filter. You can come up with an estimate for contribution with simple subtraction.