Search code examples
calgorithmhashtableprobing

Robin Hood hashing in C


I'm implementing a Hashtable that handles collisions with robin hood hashing. However, previously i had chaining instead, and the process of inserting almost 1 million keys was pretty much instantaneous. The same doesn't happen with the Robin Hood hashing which i found strange since i had the impression it was much quicker. So what i want to ask is if my insertion function is properly implemented. Here's the code:

typedef struct hashNode{
    char *word;
    int freq; //not utilized in the insertion
    int probe;// distance from the calculated index to the actual index it was inserted.
    struct hashNode* next; //not utilized in the insertion
    struct hashNode* base; //not utilized in the insertion
}hashNode;
typedef hashNode* hash_ptr;

hash_ptr hashTable[NUM_WORDS] = {NULL}; // NUM_WORDS = 1000000
                                        // Number of actual entries = 994707

hash_ptr swap(hash_ptr node, int index){
    hash_ptr temp = hashTable[index];
    hashTable[index] = node;
    return temp;
}

static void insertion(hash_ptr node,int index){
    while(hashTable[index]){
        if(node->probe > hashTable[index]->probe){
            node = swap(node,index);
        }
        node->probe++;
        index++;
        if(index > NUM_WORDS) index = 0;

    }
    hashTable[index] = node;
}

To contextualize everything:

  • the node parameter is the new entry.
  • the index parameter is where the new entry will be, if it isn't occupied.

Solution

  • The Robin Hood algorithm is very clever but it is just as dependent on having a good hash function as is any other open hashing technique.

    As a worst case, consider the worst possible hash function:

    int hash(const char* key) { return 0; }
    

    Since this will map every item to the same slot, it is easy to see that the total number of probes is quadratic in the number of entries: the first insert succeeds on the first probe; the second insert requires two probes; the third one three probes; and so on, leading to a total of n(n+1)/2 probes for n inserts. This is true whether you use simple linear probing or Robin Hood probing.

    Interestingly, this hash function might have no impact whatsoever on insertion into a chained hash table if -- and this is a very big if -- no attempt is made to verify that the inserted element is unique. (This is the case in the code you present, and it's not totally unreasonable; it is quite possible that the hash table is being built as a fixed lookup table and it is already known that the entries to be added are unique. More on this point later.)

    In the chained hash implementation, the non-verifying insert function might look like this:

    void insert(hashNode *node, int index) {
      node->next = hashTable[index];
      hashTable[index] = node;
    }
    

    Note that there is no good reason to use a doubly-linked list for a hash chain, even if you are planning to implement deletion. The extra link is just a waste of memory and cycles.

    The fact that you can build the chained hash table in (practically) no time at all does not imply that the algorithm has built a good hash table. When it comes time to look a value up in the table, the problem will be discovered: the average number of probes to find the element will be half the number of elements in the table. The Robin Hood (or linear) open-addressed hash table has exactly the same performance, because all searches start at the beginning of the table. The fact that the open-addressed hash table was also slow to build is probably almost irrelevant compared to the cost of using the table.

    We don't need a hash function quite as terrible as the "always use 0" function to produce quadratic performance. It's sufficient for the hash function to have an extremely small range of possible values (compared with the size of the hash table). If the possible values are equally likely, the chained hash will still be quadratic but the average chain length will be divided by the number of possible values. That's not the case for the linear/R.Hood probed hash, though, particularly if all the possible hash values are concentrated in a small range. Suppose, for example, the hash function is

    int hash(const char* key) {
      unsigned char h = 0;
      while (*key) h += *key++;
      return h;
    }
    

    Here, the range of the hash is limited to [0, 255). If the table size is much larger than 256, this will rapidly reduce to the same situation as the constant hash function. Very soon the first 256 entries in the hash table will be filled, and every insert (or lookup) after that point will require a linear search over a linearly-increasing compact range at the beginning of the table. So the performance will be indistinguishable from the performance of the table with a constant hash function.

    None of this is intended to motivate the use of chained hash tables. Rather, it is pointing to the importance of using a good hash function. (Good in the sense that the result of hashing a key is uniformly distributed over the entire range of possible node positions.) Nonetheless, it is generally the case that clever open-addressing schemes are more sensitive to bad hash functions than simple chaining.

    Open-addressing schemes are definitely attractive, particularly in the case of static lookup tables. They are more attractive in the case of static lookup tables because deletion can really be a pain, so not having to implement key deletion removes a huge complication. The most common solution for deletion is to replace the deleted element with a DELETED marker element. Lookup probes must still skip over the DELETED markers, but if the lookup is going to be followed by an insertion, the first DELETED marker can be remembered during the lookup scan, and overwritten by the inserted node if the key is not found. That works acceptably, but the load factor has to be calculated with the expected number of DELETED markers, and if the usage pattern sometimes successively deletes a lot of elements, the real load factor for the table will go down significantly.

    In the case where deletion is not an issue, though, open-addressed hash tables have some important advantages. In particular, they are much lower overhead in the case that the payload (the key and associated value, if any) is small. In the case of a chained hash table, every node must contain a next pointer, and the hash table index must be a vector of pointers to node chains. For a hash table whose key occupies only the space of a pointer, the overhead is 100%, which means that a linear-probed open-addressed hash table with a load factor of 50% occupies a little less space than a chained table whose index vector is fully occupied and whose nodes are allocated on demand.

    Not only is the linear probed table more storage efficient, it also provides much better reference locality, which means that the CPU's RAM caches will be used to much greater advantage. With linear probing, it might be possible to do eight probes using a single cacheline (and thus only one slow memory reference), which could be almost eight times as fast as probing through a linked list of randomly allocated table entries. (In practice, the speed up won't be this extreme, but it could well be more than twice as fast.) For string keys in cases where performance really matters, you might think about storing the length and/or the first few characters of the key in the hash entry itself, so that the pointer to the full character string is mostly only used once, to verify the successful probe.

    But both the space and time benefits of open addressing are dependent on the hash table being an array of entries, not an array of pointers to entries as in your implementation. Putting the entries directly into the hash index avoids the possibly-significant overhead of a pointer per entry (or at least per chain), and permits the efficient use of memory caches. So that's something you might want to think about in your final implementation.

    Finally, it's not necessarily the case that open addressing makes deletion complicated. In a cuckoo hash (and the various algorithms which it has inspired in recent years), deletion is no more difficult than deletion in a chained hash, and possibly even easier. In a cuckoo hash, any given key can only be in one of two places in the table (or, in some variants, one of k places for some small constant k) and a lookup operation only needs to examine those two places. (Insertion can take longer, but it is still expected O(1) for load factors less than 50%.) So you can delete an entry simply by removing it from where it is; that will have no noticeable effect on lookup/insertion speed, and the slot will be transparently reused without any further intervention being necessary. (On the down side, the two possible locations for a node are not adjacent and they are likely to be on separate cache lines. But there are only two of them for a given lookup. Some variations have better locality of reference.)


    A few last comments on your Robin Hood implementation:

    1. I'm not totally convinced that a 99.5% load factor is reasonable. Maybe it's OK, but the difference between 99% and 99.5% is so tiny that there is no obvious reason to tempt fate. Also, the rather slow remainder operation during the hash computation could be eliminated by making the size of the table a power of two (in this case 1,048,576) and computing the remainder with a bit mask. The end result might well be noticeably faster.

    2. Caching the probe count in the hash entry does work (in spite of my earlier doubts) but I still believe that the suggested approach of caching the hash value instead is superior. You can easily compute the probe distance; it's the difference between the current index in the search loop and the index computed from the cached hash value (or the cached starting index location itself, if that's what you choose to cache). That computation does not require any modification to the hash table entry, so it's cache friendlier and slightly faster, and it doesn't require any more space. (But either way, there is a storage overhead, which also reduces cache friendliness.)

    3. Finally, as noted in a comment, you have an off-by-one error in your wraparound code; it should be

      if(index >= NUM_WORDS) index = 0;
      

      With the strict greater-than test as written, your next iteration will try to use the entry at index NUM_WORDS, which is out of bounds.