Understanding Consistent Hashing

I've been looking at consistent hashing algorithms for PHP over the past couple of days. I wish to gain a better understanding of how consistent hashing actually works, so I can utilize it on some upcoming projects. It appears to me that Flexihash is really the only pure-PHP implementation out there that was easy to understand, so I took some notes from it.

I have created a small algorithm of my own to attempt to understand how it works, and how to make it work as fast as possible. I was surprised at how fast my algorithm was compared to Flexihash, which makes me wonder if my implementation is flawed in some way, or perhaps I'm not grasping a crucial piece of the entire concept.

The speed differences are listed below, over an iteration of 1 million sequential keys (0 to 1,000,000). Each node is displayed to show how many keys actually hashed to that particular node.

Flexihash:
 Time: 269 seconds
Speed: 3714 hashes/sec

 node1: 80729
 node2: 88390
 node3: 103623
 node4: 112338
 node5: 79893
 node6: 85377
 node7: 80966
 node8: 134462
 node9: 117046
node10: 117176

My implementation:
 Time: 3 seconds
Speed: 265589 hashes/sec

 node1: 100152
 node2: 100018
 node3: 99884
 node4: 99889
 node5: 100057
 node6: 100030
 node7: 99948
 node8: 99918
 node9: 100011
node10: 100093

Here is my current implementation of the hashing algorithm.

class Base_Hasher_Consistent
{
    protected $_nodes;

    public function __construct($nodes=NULL)
    {
        $this->_nodes = array();

        $node_count = count($nodes);
        $hashes_per_node = (int)(PHP_INT_MAX / $node_count);

        $old_hash_count = 0;

        foreach($nodes as $node){
            if(!($node == $nodes[$node_count-1])){
                $this->_nodes[] = array(
                    'name' => $node,
                    'begin' => $old_hash_count,
                    'end' => $old_hash_count + $hashes_per_node - 1
                );

                $old_hash_count += $hashes_per_node;
            }else{
                $this->_nodes[] = array(
                    'name' => $node,
                    'begin' => $old_hash_count,
                    'end' => PHP_INT_MAX
                );
            }
        }
    }

    public function hashToNode($data_key,$count=0)
    {
        $hash_code = (int)abs(crc32($data_key));

        foreach($this->_nodes as $node){
            if($hash_code >= $node['begin']){
                if($hash_code <= $node['end']){
                    return($node['name']);
                }
            }
        }
    }
}

Am I missing something, or is the algorithm really just faster than Flexihash? Also, I do understand that Flexihash supports looking up multiple nodes, so I'm not sure if that has anything to do with it.

I would just like some reassurance that I understand how consistent hashing works, or maybe links to some articles that really explain it well.

Thanks!

Solution

So do you want to know how crc32 works or are you just wondering how to write a good 'bucket' implementation?

Your bucket implementation looks fine. You could probably do it faster if you just did:

$bucket_index = floor($hash_code / $hashes_per_node);
return $this->_nodes[$bucket_index]['name'];

The 'algorithm' your wrote just makes $nodes number of buckets and figures out in which of those buckets a $data_key should go. The actual hashing algorithm you're using is crc32, which may not be an ideal algorithm if you're doing buckets.

If you want to know how crc32 works and why the hash is consistent .. look it up on wikipedia or something. To my knowledge there is no such thing as an inconsistent hash, so all hashing algorithms are by definition consistent.

One characteristic of a hashing algorithm is that it can generate hashes that are very dissimilar for data_keys that are similar. This is true for crc32 since crc32 is intended to detect minor changes. What it does not guarantee though is that the resulting hashes are 'well spread'. Since you're doing buckets you want a hashing algorithm that has the specific property that it generates hashes all over the spectrum. For crc32 it may just coincidentally work really well. I'd just use md5.