Search code examples
hashmapstatisticsbloom-filter

One 128bit hash vs. two different 64bit hashes (non-crypto)?


I'm currently implementing a hashtable over an dataset of about 100 billion items. Most of them are duplicates (about 75%) so the set of "unique" values is a bit smaller.

I know that I cannot avoid collisions by a 100%, but I want to make them at least very unlikely. The idea was to test against two different hash functions out of the assumption that if one hash collides the other one will probably not. See: bloom-filter.

My question now is - isn't that statistically the same as just using a single hash with twice the size? So let's say Murmur3 128 instead of Murmur3 64 + CityHash 64?


Solution

  • If they're supremely excellent hash functions, the collision probability should be the same either way. In practice, I suspect the separate hash functions will perform a little better.

    A Bloom filter is a clever way to save memory by BITORing the hash sets together, trading off some collision probability. In theory one could do the same job with two 64 bit hashes vs. the two halves of a 128 bit hash. You probably don't have enough RAM for 2128 bits, so it'd be practical to split it into (or use separate) 4 32-bit hashes and overlay them into a Bloom filter containing 232 bits = 229 bytes = 1/2 GB.

    With an excellent 64-bit hash function [I'm avoiding the term "perfect hash function" since it has a specific meaning], the probability of two entries accidentally colliding is 2-64, which is an exceedingly small number.

    If you had 100G unique items, you'd need 100G2 = 1022 or about 273 hash values, or 73 hash bits, to get the probability of having no collisions down to 1/2.