This question is very similar to rolling-hash, but there are some specifics regarding overflow/negative result which are still not clear to me.
I have as well checked out this Rabin-Karp implementation and have issues with the line bellow:
txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
I understand that the following expression might give negative result:
txtHash - RM*txt.charAt(i-M)
First question:
Second question:
If, for a moment, we didn't care about negative numbers, would it be correct to write expression bellow ?
txtHash = (txtHash - RM*txt.charAt(i-M)) % Q;
Third question, this part confuses me most:
Lets assume that the overflow cannot happen when we add Q. Why is there left-most % Q operation over the leading digit ?
txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q ) % Q;
I have read the answer i have linked and according to the answer by Aneesh, and if i understood correctly expressions bellow should be similar:
hash = hash - ((5 % p)*(10^2 %p) %p)
txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
But i don't see why they are similar because in example with hash, % p is not calculated for previous hash value, however for txtHash we calculate % Q over the previous hash as well.
First question:
if we always add Q, a large prime, can this result with negative number due to overflow ? If not, why not ? If yes, shouldn't this addition be done only if result is negative ?
The prime number Q is usually chosen so that 2Q still does not overflow the type.
Now let's see.
txtHash
is from 0 to Q - 1.RM*txt.charAt(i-M)
is large.RM*txt.charAt(i-M) % Q
is from 0 to Q - 1.txtHash - RM*txt.charAt(i-M) % Q
is from -(Q - 1) to Q - 1.txtHash + Q - RM*txt.charAt(i-M) % Q
is from 1 to 2Q - 1.So, as long as 2Q - 1 does not overflow, the above expression is fine.
Second question:
If, for a moment, we didn't care about negative numbers, would it be correct to write expression bellow ?
txtHash = (txtHash - RM*txt.charAt(i-M)) % Q;
Yeah, if the % Q
always gave a result from 0 to Q-1 (as it does in Python for example), the above expression would be fine.
Third question, this part confuses me most:
Lets assume that the overflow cannot happen when we add Q. Why is there left-most % Q operation over the leading digit ?
txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q ) % Q;
Suppose we remove the leftmost % Q
.
Then let us estimate again:
txtHash
is from 0 to Q - 1.RM*txt.charAt(i-M)
is large.txtHash - RM*txt.charAt(i-M)
is from -(Q - 1) * (CharCode - 1) to Q - 1.txtHash + Q - RM*txt.charAt(i-M)
is from -(Q - 1) * (CharCode - 2) to 2Q - 1.Still possibly negative. Not what we wanted.