Search code examples
javaalgorithmdata-structuresprobabilitybloom-filter

Space-efficient probabilistic data structures for number retrieval


Consider we have an algorithm that receives a hypothetically long stream of keys. It then generates a value between 0 and 1 for each key, as we process it, for posterior retrieval. The input set is large enough that we can't afford to store one value for each key. The value-generating rule is independent across keys.

Now, assume that we can tolerate error in the posterior lookup, but we want to still minimize the difference in retrieved and original values (i.e. asymptotically over many random retrievals).

For example, if the original value for a given key was 0.008, retrieving 0.06 is much better than retrieving 0.6.

What data structures or algorithms can we use to address this problem?

Bloom filters are the closest data structure that I can think of. One could quantize the output range, use a bloom filter for each bucket, and somehow combine their output at retrieval time to estimate the most likely value. Before I proceed with this path and reinvent the wheel, are there any known data structures, algorithms, theoretical or practical approaches to address this problem?

I am ideally looking for a solution that can parameterize the tradeoff between space and error rates.


Solution

  • Perhaps a variant of the Bloom filter called Compact Approximator: like a bloom filter but generalized so the entries are values from a lattice. That lattice is here just floats between 0 and 1 (it has more structure than just being a lattice but it satisfies the requirements) or however you're storing those numbers.

    An update replaces the relevant entries by the max between it and the value being remembered, a query computes the minimum of all its relevant entries (examples below). The results can only overestimate the true value. By reversing the ordering (swapping min and max and initializing to 1 instead of 0) you can get an underestimation, together giving an interval that contains the true value.


    So for example, using the first approximated (overestimations), putting in a number looks like this:

    index1 = hash1(key)
    data[index1] = max(data[index1], value);
    index2 = hash2(key)
    data[index2] = max(data[index2], value);
    ... etc
    

    And getting the overestimation looks like:

    result = 1
    index1 = hash1(key)
    result = min(data[index1], result);
    index2 = hash2(key)
    result = min(data[index2], result);
    ... etc