hashpolynomial-mathmodular-arithmeticrabin-karp

Why is the following code correct for computing the hash of a string?


I am currently reading about the Rabin Karp algorithm and as part of that I need to understand string polynomial hashing. From what I understand, the hash of a string is given by the following formula:

hash = ( char_0_val * p^0 + char_1_val * p^1 + ... + char_n_val ^ p^n ) mod m

Where:

  • char_i_val: is the integer value of the character plus 1 given by string[i]-'a' + 1
  • p is a prime number larger than the character set
  • m is a large prime number

The website cp-algorithms has the following entry on the subject. They say that the code to write the above is as follows:

long long compute_hash(string const& s) {
    const int p = 31;
    const int m = 1e9 + 9;
    long long hash_value = 0;
    long long p_pow = 1;
    for (char c : s) {
        hash_value = (hash_value + (c - 'a' + 1) * p_pow) % m;
        p_pow = (p_pow * p) % m;
    }
    return hash_value;
}

I understand what the program is trying to do but I do not understand why it is correct.

My question

I am having trouble understanding why the above code is correct. It has been a long time since I have done any modular math. After searching online I see that we have the following formulas for modular addition and modular multiplication:

a+b (mod m) = (a%m + b%m)%m
a*b (mod m) = (a%m * b%m)%m

Based on the above shouldn't the code be as follows?

long long compute_hash(string const& s) {
    const int p = 31;
    const int m = 1e9 + 9;
    long long hash_value = 0;
    long long p_pow = 1;
    for (char c : s) {
        int char_value = (c - 'a' + 1);
        hash_value = (hash_value%m + ((char_value%m * p_pow%m)%m)%m ) % m;
        p_pow = (p_pow%m * p%m) % m;
    }
    return hash_value;
}

What am I missing? Ideally I am seeking a breakdown of the code and an explanation of why the first version is correct.


Solution

  • Mathematically, there is no reason to reduce intermediate results modulo m.

    Operationally, there are a couple of very closely related reasons to do it:

    1. To keep numbers small enough that they can be represented efficiently.
    2. To keep numbers small enough that operations on them do not overflow.

    So let's look at some quantities and see if they need to be reduced.

    • p was defined as some value less than m, so p % m == p.
    • p_pow and hash_value have already been reduced modulo m when they were computed, reducing them modulo m again would do nothing.
    • char_value is at most 26, which is already less than m.
    • char_value * p_pow is at most 26 * (m - 1). That can be, and often will be, more than m. So reducing it modulo m would do something. But it can still be delayed, because the next step is still "safe" (no overflow)
    • char_value * p_pow + hash_value is still at most 27 * (m - 1) which is still much less than 263-1 (the maximum value for a long long, see below why I assume that a long long is 64-bit), so there is no problem yet. It's fine to reduce modulo m after the addition.

    As a bonus, the loop could actually do (263-1) / (27 * (m - 1)) iterations before it needs to reduce hash_value modulo m. That's over 341 million iterations! For most practical purposes you could therefore remove the first % m and return hash_value % m; instead.

    I used 263-1 in this calculation because p_pow = (p_pow * p) % m requires long long to be a 64-bit type (or, hypothetically, an exotic size of 36 bits or higher). If it was a 32-bit type (which is technically allowed, but rare nowadays) then the multiplication could overflow, because p_pow can be approximately a billion and a 32-bit type cannot hold 31 billion.

    BTW note that this hash function is specifically for strings that only contain lower-case letters and nothing else. Other characters could result in a negative value for char_value which is bad news because the remainder operator % in C++ works in a way such that for negative numbers it is not the "modulo operator" (misnomer, and the C++ specification does not call it that). A very similar function can be written that can take any string as input, and that would change the analysis above a little bit, but not qualitatively.