python c algorithm hash code-translation

Translate a hashing algorithm from C to Python

My client is a Python programmer and I have created a C++ backend for him which includes license generation and checking. For additional safety, the Python front-end will also perform a validity check of the license.

The license generation and checking algorithm however is based on hashing methods which rely on the fact that an integer is of a fixed byte size and bit-shifting a value will not extend the integers byte count.

This is a simplified example code:

unsigned int HashString(const char* str) {
    unsigned int hash = 3151;
    while (*str != 0) {
        hash = (hash << 3) + (*str << 2) * 3;
        str++;
    }
    return hash;
}

How can this be translated to Python? The direct translation obviously yields a different result:

def hash_string(str):
    hash = 3151
    for c in str:
        hash = (hash << 3) + (ord(c) << 2) * 3
    return hash

For instance:

hash_string("foo bar spam")  #  228667414299004
HashString("foo bar spam")   // 3355459964

Edit: The same would also be necessary for PHP since the online shop should be able to generate valid licenses, too.

Solution

The issue here is that C's unsigned int automatically rolls over when it goes past UINT_MAX, while Python's int just keeps getting bigger.

The easiest fix is just to correct at the end:

return hash % (1 << 32)

For very large strings, it maybe a little faster to mask after each operation, to avoid ending up with humongous int values that are slow to work with. But for smaller strings, that will probably be slower, because the cost of calling % 12 times instead of 1 will easily outweigh the cost of dealing with a 48-bit int.

PHP may have the same problem, or a different one.

PHP's default integer type is a C long. On a 64-bit Unix platform, this is bigger than an unsigned int, so you will have to use the same trick as on Python (either % or &, whichever makes more sense to you.)

But on a 32-bit Unix platform, or on Windows, this is the same size as unsigned int but signed, which means you need a different trick. You can't actually represent, say, 4294967293 directly (try it, and you'll get -3 instead). You can use a GMP or BCMath integer instead of the default type (in which case it's basically the same as in Python), or you can just write custom code for printing, comparing, etc. that will treat that -3 as if it were 4294967293.

Note that I'm just assuming that int is 32 bits, and long is either 32 or 64, because that happens to be true on every popular platform today. But the C standard only requires that int be at least 16 bits long, and long be at least 32 bits and no shorter than int. If you need to deal with very old platforms where int might be 16 bits (or 18!), or future platforms where it might be 64 or more, you have to adjust your code appropriately.