Search code examples
algorithmdelphimathdistributionhash-collision

Hash and reduce to bucket algorithm


The problem

We have a set of symbol sequences, which should be mapped to a pre-defined number of bucket-indexes.

Prerequisites

The symbol sequences are restricted in length (64 characters/bytes), and the hash algorithm used is the Delphi implementation of the Bob Jenkins hash for a 32bit hashvalue.

To further distribute the these hashvalues over a certain number of buckets we use the formula:

  • bucket_number := (hashvalue mod (num_buckets - 2)) + 2);
    (We don't want {0,1} to be in the result set)

The question

A colleague had some doubts, that we need to choose a prime number for num_buckets to achieve an optimal1 distribution in mapping the symbol sequences to the bucket_numbers.

The majority of the team believe that's more an unproven assumption, though our team mate just claimed that's mathematically intrinsic (without more in depth explanation).

I can imagine, that certain symbol sequence patterns we use (that's just a very limited subset of what's actually allowed) may prefer certain hashvalues, but generally I don't believe that's really significant for a large number of symbol sequences.
The hash algo should already distribute the hashvalues optimally, and I doubt that a prime number mod divisor would really make a significant difference (couldn't measure that empirically either), especially since Bob Jenkins hash calculus doesn't involve any prime numbers as well, as far I can see.

[TL;DR]
Does a prime number mod divisor matter for this case, or not?


1) optimal simply means a stable average value of number-of-sequences per bucket, which doesn't change (much) with the total number of sequences


Solution

  • Your colleague is simply wrong.

    If a hash works well, all hash values should be equally likely, with a relationship that is not obvious from the input data.

    When you take the hash mod some value, you are then mapping equally likely hash inputs to a reduced number of output buckets. The result is now not evenly distributed to the extent that outputs can be produced by different numbers of inputs. As long as the number of buckets is small relative to the range of hash values, this discrepancy is small. It is on the order of # of buckets / # of hash values. Since the number of buckets is typically under 10^6 and the number of hash values is more than 10^19, this is very small indeed. But if the number of buckets divides the range of hash values, there is no discrepancy.

    Primality doesn't enter into it except from the point that you get the best distribution when the number of buckets divides the range of the hash function. Since the range of the hash function is usually a power of 2, a prime number of buckets is unlikely to do anything for you.