I wrote a simple step-by-step implementation of Rabin-Karp algorithm for substring search, and it seems to work fine until the hash becomes greater than the modulus, and then it goes wrong...
Here is the code, it's quite simple:
typedef long long ll;
#define B 257
//base
#define M 2147483647
//modulus
//modulus for positive and negative values
ll mod(ll a){
return (a % M + M) % M;
}
//fast way to calculate modular power
ll power(ll n, ll e){
ll r = 1;
for(; e > 0; e >>= 1, n = (n*n) % M)
if(e&1) r = (r * n) % M;
return r;
}
//function to calculate de initial hash
//H(s) = s[0] * B^0 + s[1] * B^1 + ...
ll H(char sub[], int s){
ll h = 0;
for(ll i = 0; i < s; i++)
h = mod(h + mod(power(B, i) * sub[i]));
return h;
}
//brute force comparing when hashes match
bool check(char text[], char sub[], int ini, int s){
int i = 0;
while(text[ini + i] == sub[i] && i < s) i++;
return i == s;
}
//all together here
void RabinKarp(char text[], char sub[]){
int t = strlen(text), s = strlen(sub);
ll hs = H(sub, s), ht = H(text, s);
int lim = t - s;
for(int i = 0; i <= lim; i++){
if(ht == hs)
if(check(text, sub, i, s))
printf("MATCH AT %d\n", i);
ht -= text[i];
ht /= B;
ht = mod(ht + power(B, s - 1) * text[i + s]);
//we had text[i] * B^0 + text[i+1] * B^1 + ... + text[i + len - 1] * B^(len-1)
//then text[i+1] * B^1 + text[i+2] * B^2 + ... + text[i + len - 1] * B^(len-1)
//then text[i+1] * B^0 + text[i+2] * B^1 + ... + text[i + len - 1] * B^(len-2)
//finally we add a new last term text[i + len] * B^(len-1)
//so we moved the hash to the next position
}
}
int main(){
char text[] = "uvauvauvaaauva";
char sub[] = "uva";
char sub2[] = "uvauva";
RabinKarp(text, sub);
printf("----------------------------\n");
RabinKarp(text, sub2);
}
The problem is that after I take the modulus, the hash can become a small number and then, when I add some big factor to it, the hashes may not match even when they should.
For example: abc inside xabc
when I take the hash of abc and xab, suppose both of them are bigger than the modulus, so they get small after the modulus operation.
Then, when I remove 'x' and add the 'c' factor, the sum can be smaller than the modulus but still big, so it won't match.
How can I overcome this problem?
ht /= B; is not plausible. First of all because you are doing arithmetic mod M, and the modular equivalent of division is not the same as the standard one. Secondly because you should expect the same answer for x and x + M and this will not be the case.
You have text[i] * B^0 + text[i+1] * B^1 + ... + text[i + len - 1] * B^(len-1)
If you work with
text[i] * B^(len-1) + text[i+1] * B^(len - 2) + ... + text[i + len - 1] * B^0
You can subtract off text[i] * B^(len-1) and then multiply by B instead