Search code examples
algorithmhashrabin-karp

Why do we need to check for a pattern match everytime the hash value is same in Rabin Karp algorithm


I don't see the reason why we need to check for a substring match every time the hash returns the same value for the pattern and the text. Isn't the hash value returned unique for a string?


Solution

  • The hash function that is used in the Rabin Karp algorithm is a "rolling hash" such as the Rabin Fingerprint, chosen because of its property that a hash can be easily computed based on the previous hash, not because of its collision resistance.

    In the Rabin Karp algorithm, we need to compute the hash of a sliding substring. Say e.g. that we're searching for a 24-character string in this text:

    "this is the text we are comparing"
    

    We would need to compute the hash for these substrings:

    "this is the text we are "
    "his is the text we are c"
    "is is the text we are co"
    "s is the text we are com"
    " is the text we are comp"
    "is the text we are compa"
    "s the text we are compar"
    " the text we are compari"
    "the text we are comparin"
    "he text we are comparing"
    

    So we choose a "rolling hash" function where, after the hash of the first substring is computed, we can compute the hash of the second substring using the first hash, the character that is removed from the substring, and the character that is added to it:

    "this is the text we are "  ->  hash1
    "his is the text we are c"  ->  hash1 -t +c  ->  hash2
    

    Such a "rolling hash" function isn't necessarily one for which finding two strings that have the same hash is only a remote possibility, as it would be in cryptographic hash functions. So the fact that the hash is the same doesn't guarantee that the substring is the same as the search string; therefor we need to do a full string compare to be sure.

    Note that any hash function which creates a hash that is shorter than the input will necessarily have collisions. And using a hash that is much shorter than the input string is the point of the Rabin Karp algorithm; comparing the hashes is much more efficient than comparing long strings.