I'm trying to implement the djb2 hash in Python.
Here it is in C:
/* djb2 hash http://www.cse.yorku.ca/~oz/hash.html */
uint64_t djb2(size_t len, char const str[len]) {
uint64_t hash = 5381;
uint8_t c;
for(size_t i = 0; i < len; i++) {
c = str[i];
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
}
return hash;
}
And here's my attempt in Python:
from ctypes import c_uint64, c_byte, cast, POINTER
def djb2(string: str) -> c_uint64:
hash = c_uint64(5381)
raw_bytes = cast(string, POINTER(c_byte * len(string)))[0]
for i in range(0, len(raw_bytes)):
hash = c_uint64((((((hash.value << 5) & 0xffffffffffffffff) + hash.value) & 0xffffffffffffffff) + raw_bytes[i]) & 0xffffffffffffffff) # hash * 33 + c
return hash
However, I'm getting different results between the two, which I suspect is because of different overflow behavior, or otherwise mathematical differences.
The reason for the masking in the python version was to attempt to force an overflow (based on this answer).
You can implement the algorithm being run by the C code very easily in pure Python, without needing any ctypes
stuff. Just do it all with regular Python integers, and take a modulus at the end (the high bits won't effect the lower ones for the operations you're doing):
def djb2(string: bytes) -> int: # note, use a bytestring for this, not a Unicode string!
h = 5381
for c in string: # iterating over the bytestring directly gives integer values
h = h * 33 + c # use the computation from the C comments, but consider ^ instead of +
return h % 2**64 # note you may actually want % 2**32, as this hash is often 32-bit
As I commented in the code, since this is an operation defined on bytestrings, you should use a bytes
instance as the argument. Note that there are a bunch of different implementations of this algorithm. Some use use ^
(bitwise xor) instead of +
in the step where you update the hash value, and it's often defined to use an unsigned long
which was usually 32-bits instead of the explicitly 64-bit integer the C version in your question uses.