Search code examples
javastringalgorithmrabin-karp

Rolling hash overflow/negative result protection


This question is very similar to rolling-hash, but there are some specifics regarding overflow/negative result which are still not clear to me.

I have as well checked out this Rabin-Karp implementation and have issues with the line bellow:

txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;

I understand that the following expression might give negative result:

txtHash - RM*txt.charAt(i-M)

First question:

  • if we always add Q, a large prime, can this result with negative number due to overflow ?
    • If not, why not ? If yes, shouldn't this addition be done only if result is negative ?

Second question:

If, for a moment, we didn't care about negative numbers, would it be correct to write expression bellow ?

txtHash = (txtHash - RM*txt.charAt(i-M)) % Q;

Third question, this part confuses me most:

Lets assume that the overflow cannot happen when we add Q. Why is there left-most % Q operation over the leading digit ?

txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q ) % Q;

I have read the answer i have linked and according to the answer by Aneesh, and if i understood correctly expressions bellow should be similar:

hash = hash - ((5 % p)*(10^2 %p) %p)

txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;

But i don't see why they are similar because in example with hash, % p is not calculated for previous hash value, however for txtHash we calculate % Q over the previous hash as well.


Solution

  • First question:

    if we always add Q, a large prime, can this result with negative number due to overflow ? If not, why not ? If yes, shouldn't this addition be done only if result is negative ?

    The prime number Q is usually chosen so that 2Q still does not overflow the type.

    Now let's see.

    • txtHash is from 0 to Q - 1.
    • RM*txt.charAt(i-M) is large.
    • RM*txt.charAt(i-M) % Q is from 0 to Q - 1.
    • txtHash - RM*txt.charAt(i-M) % Q is from -(Q - 1) to Q - 1.
    • txtHash + Q - RM*txt.charAt(i-M) % Q is from 1 to 2Q - 1.

    So, as long as 2Q - 1 does not overflow, the above expression is fine.

    Second question:

    If, for a moment, we didn't care about negative numbers, would it be correct to write expression bellow ?

    txtHash = (txtHash - RM*txt.charAt(i-M)) % Q;

    Yeah, if the % Q always gave a result from 0 to Q-1 (as it does in Python for example), the above expression would be fine.

    Third question, this part confuses me most:

    Lets assume that the overflow cannot happen when we add Q. Why is there left-most % Q operation over the leading digit ?

    txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q ) % Q;

    Suppose we remove the leftmost % Q. Then let us estimate again:

    • txtHash is from 0 to Q - 1.
    • RM*txt.charAt(i-M) is large.
    • How large? From 0 to (Q - 1) * CharCode.
    • txtHash - RM*txt.charAt(i-M) is from -(Q - 1) * (CharCode - 1) to Q - 1.
    • txtHash + Q - RM*txt.charAt(i-M) is from -(Q - 1) * (CharCode - 2) to 2Q - 1.

    Still possibly negative. Not what we wanted.