hashpolynomial-mathmodular-arithmeticrabin-karp# Why is the following code correct for computing the hash of a string?

## My question

I am currently reading about the Rabin Karp algorithm and as part of that I need to understand string polynomial hashing. From what I understand, the hash of a string is given by the following formula:

```
hash = ( char_0_val * p^0 + char_1_val * p^1 + ... + char_n_val ^ p^n ) mod m
```

Where:

- char_i_val: is the integer value of the character plus 1 given by
`string[i]-'a' + 1`

- p is a prime number larger than the character set
- m is a large prime number

The website cp-algorithms has the following entry on the subject. They say that the code to write the above is as follows:

```
long long compute_hash(string const& s) {
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;
for (char c : s) {
hash_value = (hash_value + (c - 'a' + 1) * p_pow) % m;
p_pow = (p_pow * p) % m;
}
return hash_value;
}
```

I understand what the program is trying to do but I do not understand why it is correct.

I am having trouble understanding why the above code is correct. It has been a long time since I have done any modular math. After searching online I see that we have the following formulas for modular addition and modular multiplication:

```
a+b (mod m) = (a%m + b%m)%m
a*b (mod m) = (a%m * b%m)%m
```

Based on the above shouldn't the code be as follows?

```
long long compute_hash(string const& s) {
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;
for (char c : s) {
int char_value = (c - 'a' + 1);
hash_value = (hash_value%m + ((char_value%m * p_pow%m)%m)%m ) % m;
p_pow = (p_pow%m * p%m) % m;
}
return hash_value;
}
```

What am I missing? Ideally I am seeking a breakdown of the code and an explanation of why the first version is correct.

Solution

Mathematically, there is no reason to reduce intermediate results modulo `m`

.

Operationally, there are a couple of very closely related reasons to do it:

- To keep numbers small enough that they can be represented efficiently.
- To keep numbers small enough that operations on them do not overflow.

So let's look at some quantities and see if they need to be reduced.

`p`

was defined as some value less than`m`

, so`p % m == p`

.`p_pow`

and`hash_value`

have already been reduced modulo`m`

when they were computed, reducing them modulo`m`

again would do nothing.`char_value`

is at most 26, which is already less than`m`

.`char_value * p_pow`

is at most`26 * (m - 1)`

. That can be, and often will be, more than`m`

. So reducing it modulo`m`

would do something. But it can still be delayed, because the next step is still "safe" (no overflow)`char_value * p_pow + hash_value`

is still at most`27 * (m - 1)`

which is still much less than 2^{63}-1 (the maximum value for a`long long`

, see below why I assume that a`long long`

is 64-bit), so there is no problem yet. It's fine to reduce modulo`m`

*after*the addition.

As a bonus, the loop could actually do (2^{63}-1) / (27 * (m - 1)) iterations before it needs to reduce `hash_value`

modulo `m`

. That's over 341 million iterations! For most practical purposes you could therefore remove the first `% m`

and `return hash_value % m;`

instead.

I used 2^{63}-1 in this calculation because `p_pow = (p_pow * p) % m`

requires `long long`

to be a 64-bit type (or, hypothetically, an exotic size of 36 bits or higher). If it was a 32-bit type (which is technically allowed, but rare nowadays) then the multiplication could overflow, because `p_pow`

can be approximately a billion and a 32-bit type cannot hold 31 billion.

BTW note that this hash function is specifically for strings that only contain lower-case letters and nothing else. Other characters could result in a negative value for `char_value`

which is bad news because the remainder operator `%`

in C++ works in a way such that for negative numbers it is not the "modulo operator" (misnomer, and the C++ specification does not call it that). A very similar function can be written that can take any string as input, and that would change the analysis above a little bit, but not qualitatively.

- Does anyone know how to calculate Authenticode File hash in PowerShell?
- Hash input field data with PHP
- Is it worth hashing passwords on the client side
- hash function for string
- php - Is strpos the fastest way to search for a string in a large body of text?
- How to generate a Hash or checksum value on Python Dataframe (created from a fixed width file)?
- Windows Equivalent to `sha256sum -c` (cryptographic hash, digest file, recursive integrity check, SHA256SUMS)
- Know git hash before committing?
- How do I map an array of hashes?
- How can I hash a string with SHA256?
- Unique identifier for ratios with tolerance
- How can I calculate a hash for a filesystem-directory using Python?
- Salt and hash a password in Python
- How do I replicate PHP's hash_hmac function in C++?
- php mysqli_connect: authentication method unknown to the client [caching_sha2_password]
- PDFBox - Signing a PDF using an external service (document has been altered or corrupted)
- Generate a Hash from string in Javascript
- How do I make a python dataclass inherit __hash__?
- Hash passwords in md5 with symfony 6.2
- Encoding Spotify URI to Spotify Codes
- Unable to download signingReport for SHA1 and SHA-256
- Prevent multiple form submissions in Django
- How to find the most frequent element in a stream of numbers with additions and deletions allowed
- Convert flat array [k1,v1,k2,v2] to object {k1:v1,k2:v2} in JavaScript?
- Strange, unexpected behavior (disappearing/changing values) when using Hash default value, e.g. Hash.new([])
- Will a git commit have the same hash after reverting previous commit?
- What hash GCP uses to derive the ID of a service account's uploaded public key?
- Using Onvif PTZ on IP camera with digest in C#
- Why am I getting different hash for duplicated files?
- HashSet vs. List performance