Search code examples
calgorithmhashtable

Spread index for hash table implementation


I use a hash table as the main data structure for a project in C. I use this implementation of SipHash as the hash function. I use chained lists to handle collisions. The index for a given element to be inserted or searched in the hash table is computed as follows:

size_t genindex(const char *key, const size_t ht_len, const uint8_t *k) {
    uint64_t hash = siphash((const uint8_t *)key, strlen(key), k);
    size_t index = (size_t)(hash & (uint64_t) (ht_len - 1));
    return index;
}

This approach creates a lot of collisions as about 60% of my table is unused. The data I store in the hash table is relatively small (about 60 elements). I could use another collision handling method, but this requires a lot more work and I will consider it if I fail to find a solution. I think I should modify the index resize method.

Are there other method to resize the hash to fit it to the hash table size ht_len than using a modulo operation ?

Note: the table size is fixed at its initialization. It is generated as the closest prime number to the data size and its chosen between 2^i and 2^(i+1), the powers of two lower and greater than the data size.


Solution

  • As others have commented, using a hash table seems overly complicated for the purpose.

    However, the problem with the insufficient spreading is probably due to this line:

        size_t index = (size_t)(hash & (uint64_t) (ht_len - 1));
    

    which is supposed to calculate "modulo ht_len". However, using & for modulo only works if ht_len is a power of two.

    Doing & with ht_len-1 will generally spread the result over a set of indices of 2^n values where "n" is the number of ones in the binary representation of ht_len-1. Or put in another way: all indices, i, not fulfilling i & (ht_len-1) == i will be excluded.

    For example, consider the values 0, ..., 15. If ht_len is 8, we AND with 8-1=7 and get:

        0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7
    

    (i.e. the results are spread over the 8 possible remainders).

    If ht_len is 7, we AND with 7-1=6 and get:

        0, 0, 2, 2, 4, 4, 6, 6, 0, 0, 2, 2, 4, 4, 6, 6
    

    (i.e. only 4 results are possible).

    Doing actual modulo 7 would give:

        0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1
    

    So in short, the mentioned line should be changed to calculate modulo correctly:

        size_t index = (size_t)(hash % (uint64_t) (ht_len));