Search code examples
algorithmtime-complexityrabin-karp

Average case runtime of Rabin-Karp for long strings


I'm having trouble understanding why the average case runtime for the Rabin-Karp algorithm is O(n+m), where n is the length of the input string, and m is the length of the pattern string. My understanding is that the runtime should be O(n+m) for all n and m, even if n and m are very large.

Let our hashing function return integers between 1 and k. Then the probability of a spurious collision at each position is 1/k, so we expect to get on average O(n/k) spurious collisions in addition to one correct hashing match (if there is a match). Thus we expect O(1+n/k) hashing matches. Each time we get a match, we have to check on average O(m) characters, so the total average runtime is O((1+n/k)*m).

For small n (n<<k) this approximates to the usual quoted average runtime complexity of O(n+m). But for large n (n>>k), it is O(n/km)=O(nm). So the average runtime complexity is O(nm) for large n, which means the average runtime is theoretically O(nm), not O(n+m). But of course, we are usually in the regime of n<<k, so we can expect O(n+m) in practice for short strings.

If we try to get around the collision problem, by adjusting k as n gets bigger to keep the collision rate a low constant, then k will need to be O(n). But then the algorithm is still O(nm) for large n, because hashing the pattern string will take O(km)=O(n*m).

My question is, where is the flaw in this analysis? Is Rabin-Karp actually average case O(n*m) theoretically? What am I missing?


Solution

  • First, your analysis is incorrect. The size of the hash is log(k), not k.

    Second, the usual assumption is that the sizes of things fit into a constant number of machine words, and that you can do mathematical operations on them in constant time.

    Since you need k ~ n, the hash also fits into a machine word, and you can update the hash in constant time.