Search code examples
data-structureshashsetbloom-filter

How can the accuracy of a Bloom filter be improved by introducing other data structures?


Suppose I have a very large dataset (for example, 1 billion keys), and I want to determine if a particular key is in this dataset. In this case, a Bloom filter is a good choice (with a time complexity of O(k) and relatively manageable memory usage). However, one issue with the Bloom filter is the possibility of false positives, meaning that it may incorrectly indicate that a key is present in the set when it is not. My question is whether it’s possible to sacrifice some time complexity and relatively small memory usage in order to ensure that when checking if a key exists in the set, the result will not be a false positive.

For example, one rough approach I thought of is to store the dataset across multiple files on disk. When checking for a key, I could load each file block one by one, and then check if the key exists. Since each file block would not be too large, memory usage would remain manageable.


Solution

  • Just one option: you can potentially scale the bloom filter itself to the point where the probability of a false positive is of the same order of magnitude as other background issues (e.g. your computer hardware failing - of its own accord or due to some accident / event - or electricity supply, or you dying from whatever cause today).

    Using the probability formula from https://en.wikipedia.org/wiki/Bloom_filter

    pow(1 - pow(e, -k*n/m), k)
    

    k = number of hash functions used, and wikipedia says (m / n) ln 2 is optimal

    n = number of keys, m = number of bits in bloom filter

    In python:

    import math
    def f(n, m):
        k = (m / n) * math.log(2)
        return (1 - math.e ** (-k * n / m)) ** k
    

    Your n value is "for example, 1 billion keys", and let's say we aim for a probability equivalent to a 40 year old dying today (from https://www.ssa.gov/oact/STATS/table4c6.html 0.002066 per year, so 5.66E-6 per day.

    A few invocations of f shows us crossing that probability threshold:

    >>> f(1E9, 3E10)
    5.498664438639185e-07
    >>> f(1E9, 2E10)
    6.711786678855935e-05
    

    So, 3E10 bits = is 3.5GB of bloom filter for your 1E9 keys, which is pretty reasonable for most modern work servers that might be expected to process that many keys and the associated data. The probability drops very quickly though - e.g. f(1E9, 4E10) ~= 4.5E-9 - so you get 1/100th of the risk for 1/3rd extra memory.

    An alternative threshold - humanity being wiped out by an asteroid today: "Based on Moon crater history and NEA observations, the probability of a giant impact is between 0.03 and 0.3 for the next billion years." - 0.03 / 1E9 / 365 ~= 8.2E-14, and interesting because the filter false positives decrease logarithmically, we can get well under that with f(1E9, 7E10) ~= 2.5E-15, still only using 8.15GB of data.