I was browsing the source code of .Net's Dictionary and HashSet, which are implemented as a hash table, basically.
They always select prime numbers for the internal size of their tables. I assume this is because the hash values these collections work with are generally poorly randomized (for instance, hash code for integers is simply the same integer) and this helps them reduce collisions when storing keys which follow a pattern. (This is not the main topic, but feel free to drop a comment if you feel it's necessary. I'm just mentioning it here in case it's important for the main question.)
The code which selects the primes can be found on GitHub. I was surprised to find out that their precomputed list of primes avoids 101 and then all subsequent primes of type 101*N+1. In other words, the table doesn't list 101, 607, 809, 1213 etc.
Even when going off the precomputed table and calculating a new prime, before accepting it they specifically check for (i - 1) % HashPrime != 0
(the HashPrime
is 101).
My question is: why? What's so bad about that specific family of primes? I'm looking for examples of issues these numbers can cause.
As @GuruStron commented, Hashtable.InitHash sheds some light. Hashtable
is an older, non-generic hash table implementation back from .Net Framework 1.0 which I assume nobody sane uses anymore.
It used a form of double hashing. True double hashing would use two different hash functions derived from the key object. The second hash is used for calculating the step when looking for alternative buckets in case of collisions. This means that in case of collisions of the first hash function, the alternative buckets should still differ for different keys. Unfortunately, .Net objects standardly support only one hash function. So Hashtable
doesn't use true double hashing and it calculates the second hash from the first one instead. The function used was basically just SecondHash = FirstHash * SomePrime + 1
.
The prime they picked was 101. So if they allowed multiples of 101 + 1 as table size, they could potentially get a step size which is equal to table size. This means that the next bucket for the key would be the same as for the as the primary one, which would fail again and cause an infinite loop.
At least that's the one failure scenario I could understand, maybe there's more. In any case, the new hash table implementations don't reference the 101 constant so this doesn't seem to be a limitation anymore. I assume that they're still avoiding 101*N+1 for historical reasons only.