Search code examples
c++algorithmmathhashrabin-karp

Prime Number and Block Length in Karp Rabin


I found a rabin karp code from that site and changed to try. Changed code is below. You can see words and their hash values in hashtable.txt. for below example hashtable.txt seems right.

But when I changed M (block length) to 150 I am getting wrong results. For example in hashtable.txt first line and 6th line must be same but their hash values are different.

Or when I changed q (prime number) to 683303 it's getting wrong results too.

What's the relation between prime number and block length in rabin karp algorithm, and what's reason of wrong results?

#include<stdio.h>
#include<string.h>
#include <fstream>
#include <iostream>
// d is the number of characters in input alphabet
#define d 256 
int M = 80;
/*  
txt  -> text
q    -> A prime number
*/
using namespace std;

void writeTable(char *txt, int q)
{
ofstream myfile;
myfile.open ("hashtable.txt");
int N = strlen(txt);
int i, j;
int t = 0; // hash value for txt
int h = 1;

// The value of h would be "pow(d, M-1)%q"
for (i = 0; i < M-1; i++)
    h = (h*d)%q;

// Calculate the hash value of pattern and first window of text
for (i = 0; i < M; i++)
{
    t = (d*t + txt[i])%q;
}

// Slide the pattern over text one by one 
for (i = 0; i <= N - M; i++)
{
    myfile << t <<" ";
    for (long z = i; z < M+i; z++){myfile<<txt[z];}myfile<<"\n";

    // Calulate hash value for next window of text: Remove leading digit, 
    // add trailing digit           
    if ( i < N-M )
    {
        t = (d*(t - txt[i]*h) + txt[i+M])%q;

        // We might get negative value of t, converting it to positive
        if(t < 0) 
          t = (t + q); 
    }
}

myfile.close();
}

/* Driver program to test above function */
int main()
{
    char *txt ="abcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcde";

int q = 683303;  // A prime number

writeTable(txt, q);

printf("finish");
getchar();
return 0;
}

Solution

  • The computation

    t = (d*(t - txt[i]*h) + txt[i+M])%q;
    

    can overflow. The maximal value of txt[i] is d-1, and h can be as large as q-1. So if (q-1)*(d-1)*d > INT_MAX, there is the possibility of integer overflow. That limits the size of the prime that can be safely chosen to INT_MAX/(d*(d-1)) + 1.

    If q is larger than that, that poses restrictions on the admissible values for M, namely M must be such that

    h <= INT_MAX/(d*(d-1))
    

    to safely prevent overflow.

    With q = 683303 and M = 80, you get h = 182084, and

    h*d*(d-1) = 182084 * 256 * 255 = 11886443520
    

    is larger than INT_MAX if int is 32 bits wide as it usually is.

    If your ints are 32 bits wide, you have overflow for the example from the beginning, because h*256*97 = 4521509888 > 2147483647.