Search code examples
algorithmbloom-filter

Why bloom filters use the same array for all k hashing algorithms


I understand that in order to reduce the chance of a single hash colision to result a false positive bloom filters use multiple (k) hashes.

Wouldn't it be more beneficial to use k arrays, one for each hashing algorithm so that if by coinsidence many input keys are mapped by algorithm A to the same value and stored at the same array cell, and then another key is mapped by alorithm B to the same value - this is a valuable information that should be marked separately. I think that k arrays of size m/k should give a better result than a single array of size m. Am I wrong?


Solution

  • Contrary to your intuition (and mine!), your proposed data structure will actually have a slightly worse false positive rate on lookups than a Bloom filter (assuming perfect hash functions), because you've eliminated the possibility of self-collision - that is, you've made it so that two hash functions run on the same element never return the same array index. That may sound like a good thing, but actually, it's bad, because it means that on average more bits will be set in the data structure you propose (henceforth, the "Akiva filter") than would be set in a Bloom filter of the same size, number of hash functions, and number of elements.

    A simple example

    To see starkly how eliminating self-collision can be bad, let's start by considering the extreme (and unrealistic) case where:

    • k=2 (we have only two hash functions, h₁ and h₂),
    • m=4 (our filter is only 4 bits large), and
    • we've added only one element, x, to our filter.

    Suppose we then want to check whether some other value, y, is a member of the filter or not.

    If our filter is a Bloom filter, there's a 25% chance that we got a lucky hash collision when we added x - i.e. that h₁(x) = h₂(x) and so the filter has only 1 bit set. If that's the case, then our chance of a false positive when we look up y is low - just 0.25 * 0.25 = 0.0625. On the other hand, if we got unlucky and h₁(x) ≠ h₂(x) (i.e. two bits in our Bloom filter are set), then our chance of a false positive when we look up y is 0.5 * 0.5 = 0.25. So our overall chance of a false positive is 0.25 * 0.0625 + 0.75 * 0.25 = 0.203125.

    On the other hand, if our filter is an Akiva filter, split into two arrays of size m/k=2, then there is a 100% chance that after adding element x there are two bits set - one in the first sub-array and one in the second. So our chance of a false positive when we look up y is exactly 25%.

    A general proof

    Can we demonstrate that it is true in general that an Akiva filter has a worse false positive rate than a Bloom filter, regardless of the values of m (the number of bits in the data structure), k (the number of hash functions), and n (the number of elements that have been added to the filter)? Yes; to do so, let's first determine the probability of a false positive as an expression of m, k, and n for each data structure.

    Let us start with the Akiva filter. We look up k distinct sub-array indexes of the element we are looking up, and each of them has a (completely independent) 1 - (1 - k/m) chance of already being set, so our overall chance of a false positive is (1 - (1 - k/m)).

    Next, we consider the Bloom filter. Here we can defer to some published scholarship:

    In the analysis of Bloom filter false positive rate it is typically assumed that the hash functions are perfect whereby they produce an independent and random index value for each object and thus the false positive rate is only a function of m, n, and k. The “classic” analysis of Bloom filter false positive rate is as follows. ... The probability that an arbitrary bit is not set after k bit insertions from the mapping of one object is (1−1/m). For n objects mapped-in, the probability that an arbitrary bit is not set is (1 − 1/m)ᵏⁿ and therefore the probability that an arbitrary bit is set is pset = 1 - (1 − 1/m)ᵏⁿ. Thus, for k hashes for a test object not mapped in the Bloom filter, the probability of false positive is (we call this the “classic formula”),

    pfalse = pset = (1 - (1 − 1/m)ᵏⁿ)

    So we have a false positive probability of (1 - (1 - k/m)) for the Akiva filter and (1 - (1 − 1/m)ᵏⁿ) for the Bloom filter; but when is the former larger? The answer is that the Akiva filter's false positive rate is strictly larger (worse) for k ≥ 2, regardless of the values of m or n. To prove this, we first observe that the inequality we wish to prove, (1 - (1 - k/m)) > (1 - (1 − 1/m)ᵏⁿ), is equivalent to a much simpler inequality, by performing the following transformations:

    • (1 - (1 - k/m)) > (1 - (1 − 1/m)ᵏⁿ), iff
    • (1 - (1 - k/m)) > (1 - (1 − 1/m)ᵏⁿ) (by taking kth roots), iff
    • (1 - k/m) < (1 − 1/m)ᵏⁿ (by subtracting 1 from each side, then flipping the sign of both sides), iff
    • 1 - k/m < (1 − 1/m) (by taking nth roots)

    Finally we can easily prove that 1 - k/m < (1 − 1/m) for all k≥2 by first observing that this is true for k=2:

    1 - 2/m < 1 - 2/m + 1/m² = (1 − 1/m)

    ... and then proceeding by induction. Suppose that (1 - (k-1)/m) < (1 − 1/m)k-1. Then we can express (1 − 1/m)k-1 as (1 - (k-1)/m) + ε, for some ε>0. Then

    (1 − 1/m)
    = (1 - 1/m)(1 - (k-1)/m + ε)
    = 1 - (k-1)/m + ε - 1/m + (k-1)/m² - ε/m
    = 1 - k/m + (1 - 1/m)ε + (k-1)/m²
    > 1 - k/m

    QED.

    In conclusion

    Burton Howard Bloom seems to have known what he was doing; the Bloom filter has a lower false positive rate than the alternative structure you contemplate with k distinct arrays, and is therefore superior.