One 128bit hash vs. two different 64bit hashes (non-crypto)?

I'm currently implementing a hashtable over an dataset of about 100 billion items. Most of them are duplicates (about 75%) so the set of "unique" values is a bit smaller.

I know that I cannot avoid collisions by a 100%, but I want to make them at least very unlikely. The idea was to test against two different hash functions out of the assumption that if one hash collides the other one will probably not. See: bloom-filter.

My question now is - isn't that statistically the same as just using a single hash with twice the size? So let's say Murmur3 128 instead of Murmur3 64 + CityHash 64?

Solution

If they're supremely excellent hash functions, the collision probability should be the same either way. In practice, I suspect the separate hash functions will perform a little better.

A Bloom filter is a clever way to save memory by BITORing the hash sets together, trading off some collision probability. In theory one could do the same job with two 64 bit hashes vs. the two halves of a 128 bit hash. You probably don't have enough RAM for 2¹²⁸ bits, so it'd be practical to split it into (or use separate) 4 32-bit hashes and overlay them into a Bloom filter containing 2³² bits = 2²⁹ bytes = 1/2 GB.

With an excellent 64-bit hash function [I'm avoiding the term "perfect hash function" since it has a specific meaning], the probability of two entries accidentally colliding is 2^-64, which is an exceedingly small number.

If you had 100G unique items, you'd need 100G² = 10²² or about 2⁷³ hash values, or 73 hash bits, to get the probability of having no collisions down to 1/2.