I've been looking at consistent hashing algorithms for PHP over the past couple of days. I wish to gain a better understanding of how consistent hashing actually works, so I can utilize it on some upcoming projects. It appears to me that Flexihash is really the only pure-PHP implementation out there that was easy to understand, so I took some notes from it.
I have created a small algorithm of my own to attempt to understand how it works, and how to make it work as fast as possible. I was surprised at how fast my algorithm was compared to Flexihash, which makes me wonder if my implementation is flawed in some way, or perhaps I'm not grasping a crucial piece of the entire concept.
The speed differences are listed below, over an iteration of 1 million sequential keys (0 to 1,000,000). Each node is displayed to show how many keys actually hashed to that particular node.
Flexihash:
Time: 269 seconds
Speed: 3714 hashes/sec
node1: 80729
node2: 88390
node3: 103623
node4: 112338
node5: 79893
node6: 85377
node7: 80966
node8: 134462
node9: 117046
node10: 117176
My implementation:
Time: 3 seconds
Speed: 265589 hashes/sec
node1: 100152
node2: 100018
node3: 99884
node4: 99889
node5: 100057
node6: 100030
node7: 99948
node8: 99918
node9: 100011
node10: 100093
Here is my current implementation of the hashing algorithm.
class Base_Hasher_Consistent
{
protected $_nodes;
public function __construct($nodes=NULL)
{
$this->_nodes = array();
$node_count = count($nodes);
$hashes_per_node = (int)(PHP_INT_MAX / $node_count);
$old_hash_count = 0;
foreach($nodes as $node){
if(!($node == $nodes[$node_count-1])){
$this->_nodes[] = array(
'name' => $node,
'begin' => $old_hash_count,
'end' => $old_hash_count + $hashes_per_node - 1
);
$old_hash_count += $hashes_per_node;
}else{
$this->_nodes[] = array(
'name' => $node,
'begin' => $old_hash_count,
'end' => PHP_INT_MAX
);
}
}
}
public function hashToNode($data_key,$count=0)
{
$hash_code = (int)abs(crc32($data_key));
foreach($this->_nodes as $node){
if($hash_code >= $node['begin']){
if($hash_code <= $node['end']){
return($node['name']);
}
}
}
}
}
Am I missing something, or is the algorithm really just faster than Flexihash? Also, I do understand that Flexihash supports looking up multiple nodes, so I'm not sure if that has anything to do with it.
I would just like some reassurance that I understand how consistent hashing works, or maybe links to some articles that really explain it well.
Thanks!
So do you want to know how crc32 works or are you just wondering how to write a good 'bucket' implementation?
Your bucket implementation looks fine. You could probably do it faster if you just did:
$bucket_index = floor($hash_code / $hashes_per_node);
return $this->_nodes[$bucket_index]['name'];
The 'algorithm' your wrote just makes $nodes
number of buckets and figures out in which of those buckets a $data_key
should go. The actual hashing algorithm you're using is crc32, which may not be an ideal algorithm if you're doing buckets.
If you want to know how crc32 works and why the hash is consistent .. look it up on wikipedia or something. To my knowledge there is no such thing as an inconsistent hash, so all hashing algorithms are by definition consistent.
One characteristic of a hashing algorithm is that it can generate hashes that are very dissimilar for data_keys that are similar. This is true for crc32 since crc32 is intended to detect minor changes. What it does not guarantee though is that the resulting hashes are 'well spread'. Since you're doing buckets you want a hashing algorithm that has the specific property that it generates hashes all over the spectrum. For crc32 it may just coincidentally work really well. I'd just use md5.