What are the best prime numbers to choose for a double hashed hash table size?
side info
- the hash table is part of a word analysis project, Markov models, training bots to model and generate text as if someone else would write it (which takes a lot of words, sentences, transcripts, books... the bigger the corpus, the better)
- I'm not familiar with most of the math around prime numbers but I will read on everything you guys propose and then try to go from there
what I have in mind:
- the prime numbers shouldn't be too far/close to each other ----> I don't have to increase the size frequently, but the hash table doesn't end up half empty (less collisions, looking for ideal ratio between load factor and hash table size)
- optimal for a big corpus - I'm not sure how big the prime numbers I have to choose should be, never did this before...
- I also thought of implementing a function (not a hash function) that'd just double the size of the hash table and then look for the closest prime number ------>
but that has a running time of O(n) because a prime is only divisible by itself ____( I have to check whether all the numbers up to the number that's double the size of the current hash table size have the remainder other than zero, then increment the size by one/go to the next odd number and test the whole loop again)________ ------> you can imagine that that would be very slow so the better approach is just to have a fixed set of prime numbers up to a million (just for illustration purposes) or so and then just use these for any size changes
Thanks, any additional questions appreciated
Choose high of twin prime numbers, i. e. when p
and p - 2
are primes, choose p
as double hash capacity, because hash_code % (size - 2)
is a good secondary step function for double hashing algorithm, and modulo prime number is somewhat more "robust" than modulo composite number (if size - 2
is composite).
For small sizes (somewhere around 1000 or so) choose all primes, except low ones of twin pairs, because twin pairs are too rare in the beginning of natural numbers scale, for good size predictability.
Add sizes of 5 and 11 (though they are low in twin primes) to better address very small table sizes.
Exclude numbers that are frequently used in multiplication hash functions, in Java it is 31
that is used in String
hash function, I don't know about Python.
All the above is carefully coded in this Java runnable, with a lot of pre-generated table sizes (trying to keep 0.005 max difference between neighbouring table sizes):
https://github.com/OpenHFT/Koloboke/blob/0498951705b45be2e1528afd786c03308c36e5dc/lib/impl/src/main/java/net/openhft/koloboke/collect/impl/hash/DHashCapacities.java#L255-L272
P. S. My personal belief is that double hashing is never an optimal open addressing flavor, because of modulo operations which are disproportionately expensive in modern CPUs. Consider using QHash.