Search code examples
algorithmsubstringknuth-morris-pratt

Knuth-Morris-Pratt algorithm corner-case


In the Knuth-Morris-Pratt algorithm when the "substring" word is a sequence of the same letter, eg. "AAAAAAAA...", the failure table it something like this: "-1, 0, 1, 2, 3, 4, 5,...".

This means if the test is something like "AAAAAAAB" When we will reach "B" we will go back X characters and start trying to match again, although we know we should start after B.

Am I missing something?

Edit (making the example specific):

Let's say the test is: AAAAAAAABAAAAAAAAA, that is (8 As, B, 9 As) and the word looking for is AAAAAAAAA, that is (9 As).

The fail table will be: -1, 0, 1, 2, 3, 4, 5, 6, 7.

At some point it will be m = 0, i = 8. This will fail, so m will become m = m + i - T[8] = 0 + 8 - 7 => m = 1 and i will be T[8] = 7.

This will fail again, so now we will have m = 2, i = 6, and then m = 3, i = 7, etc..


Solution

  • You will go back length(needle) characters, but you will only start matching at the offset given by the failure table. In this case, if there are 7 A's and you fail on the B, T[7] will say "skip 7 characters", so you check needle[7] vs. haystack[failed-length(needle)+7] (where failed is the index of haystack where the B in the needle compared to an A). Hence it'll run in linear time, always skipping comparison for the 6 of the 7 A's that you already matched. A smarter algorithm could probably skip ahead a little more, but only constants worth of more, as it can't be better than linear.