We have a set of symbol sequences, which should be mapped to a pre-defined number of bucket-indexes.
The symbol sequences are restricted in length (64 characters/bytes), and the hash algorithm used is the Delphi implementation of the Bob Jenkins hash for a 32bit hashvalue.
To further distribute the these hashvalues over a certain number of buckets we use the formula:
bucket_number := (hashvalue mod (num_buckets - 2)) + 2);
A colleague had some doubts, that we need to choose a prime number for num_buckets
to achieve an optimal1 distribution in mapping the symbol sequences to the bucket_number
s.
The majority of the team believe that's more an unproven assumption, though our team mate just claimed that's mathematically intrinsic (without more in depth explanation).
I can imagine, that certain symbol sequence patterns we use (that's just a very limited subset of what's actually allowed) may prefer certain hashvalue
s, but generally I don't believe that's really significant for a large number of symbol sequences.
The hash algo should already distribute the hashvalue
s optimally, and I doubt that a prime number mod divisor would really make a significant difference (couldn't measure that empirically either), especially since Bob Jenkins hash calculus doesn't involve any prime numbers as well, as far I can see.
[TL;DR]
Does a prime number mod divisor matter for this case, or not?
1) optimal simply means a stable average value of number-of-sequences per bucket, which doesn't change (much) with the total number of sequences
Your colleague is simply wrong.
If a hash works well, all hash values should be equally likely, with a relationship that is not obvious from the input data.
When you take the hash mod some value, you are then mapping equally likely hash inputs to a reduced number of output buckets. The result is now not evenly distributed to the extent that outputs can be produced by different numbers of inputs. As long as the number of buckets is small relative to the range of hash values, this discrepancy is small. It is on the order of # of buckets / # of hash values. Since the number of buckets is typically under 10^6 and the number of hash values is more than 10^19, this is very small indeed. But if the number of buckets divides the range of hash values, there is no discrepancy.
Primality doesn't enter into it except from the point that you get the best distribution when the number of buckets divides the range of the hash function. Since the range of the hash function is usually a power of 2, a prime number of buckets is unlikely to do anything for you.