Search code examples
javaalgorithmmodulorabin-karp

Rabin-Karp not working for large primes (gives wrong output)


So I was solving this problem (Rabin Karp's algorithm) and wrote this solution:

private static void searchPattern(String text, String pattern) {
    int txt_len = text.length(), pat_len = pattern.length();
    int hash_pat = 0, hash_txt = 0; // hash values for pattern and text's substrings
    final int mod = 100005;         // prime number to calculate modulo... larger modulo denominator reduces collisions in hash
    final int d = 256;              // to include all the ascii character codes
    int coeff = 1;                  // stores the multiplier (or coeffecient) for the first index of the sliding window

    /* 
     * HASHING PATTERN:
     * say text    = "abcd", then
     * hashed text = 256^3 *'a' + 256^2 *'b' + 256^1 *'c' + 256^0 *'d'
     */

    // The value of coeff would be "(d^(pat_len - 1)) % mod"
    for (int i = 0; i < pat_len - 1; i++)
        coeff = (coeff * d) % mod;

    // calculate hash of the first window and the pattern itself
    for (int i = 0; i < pat_len; i++) {
        hash_pat = (d * hash_pat + pattern.charAt(i)) % mod;
        hash_txt = (d * hash_txt + text.charAt(i)) % mod;
    }

    for (int i = 0; i < txt_len - pat_len; i++) {
        if (hash_txt == hash_pat) {
            // our chances of collisions are quite less (1/mod) so we dont need to recheck the substring
            System.out.println("Pattern found at index " + i);
        }
        hash_txt = (d * (hash_txt - text.charAt(i) * coeff) + text.charAt(i + pat_len)) % mod; // calculating next window (i+1 th index)

        // We might get negative value of t, converting it to positive
        if (hash_txt < 0)
            hash_txt = hash_txt + mod;
    }
    if (hash_txt == hash_pat) // checking for the last window
        System.out.println("Pattern found at index " + (txt_len - pat_len));
}

Now this code is simply not working if the mod = 1000000007, whereas as soon as we take some other prime number (large enough, like 1e5+7), the code magically starts working !

The line at which the code's logic failed is:

hash_txt = (d * (hash_txt - text.charAt(i) * coeff) + text.charAt(i + pat_len)) % mod;

Can someone please tell me why is this happening ??? Maybe this is a stupid doubt but I just do not understand.


Solution

  • In Java, an int is a 32-bit integer. If a calculation with such number mathematically yields a result that needs more binary digits, the extra digits are silently discarded. This is called overflow.

    To avoid this, the Rabin-Karp algorithm reduces results modulo some prime in each step, thereby keeping the number small enough that the next step will not overflow. For this to work, the prime chosen must be suitably small that

    d * (hash + max(char) * coeff) + max(char)) < max(int)
    

    Since

    0 ≤ hash < p,
    1 ≤ coeff < p,
    max(char) = 216
    max(int) = 231

    any prime smaller than 27=128 will do. For larger primes, it depends on what their coeff ends up being, but even if we select one with the smallest possible coeff = 1, the prime must not exceed 223, which is much smaller than the prime you used.

    In practice, one therefore uses Rabin-Karp with an integer datatype that is significantly bigger that the character type, such as a long (64 bits). Then, any prime < 239 will do.

    Even then, if it worth noting that your reasoning

    our chances of collisions are quite less (1/mod) so we dont need to recheck the substring

    is flawed, because the probability is determined not by chance, but by the strings being checked. Unless you know the probability distribution of your inputs, you can't know what the probability of failure is. That's why Rabin-Karp rechecks the string to make sure.