In the Knuth-Morris-Pratt algorithm when the "substring" word is a sequence of the same letter, eg. "AAAAAAAA...", the failure table it something like this: "-1, 0, 1, 2, 3, 4, 5,...".
This means if the test is something like "AAAAAAAB" When we will reach "B" we will go back X characters and start trying to match again, although we know we should start after B.
Am I missing something?
Edit (making the example specific):
Let's say the test is: AAAAAAAABAAAAAAAAA, that is (8 As, B, 9 As) and the word looking for is AAAAAAAAA, that is (9 As).
The fail table will be: -1, 0, 1, 2, 3, 4, 5, 6, 7.
At some point it will be m = 0, i = 8. This will fail, so m will become m = m + i - T[8] = 0 + 8 - 7 => m = 1 and i will be T[8] = 7.
This will fail again, so now we will have m = 2, i = 6, and then m = 3, i = 7, etc..
You will go back length(needle)
characters, but you will only start matching at the offset given by the failure table. In this case, if there are 7 A's and you fail on the B, T[7]
will say "skip 7 characters", so you check needle[7]
vs. haystack[failed-length(needle)+7]
(where failed is the index of haystack where the B in the needle compared to an A). Hence it'll run in linear time, always skipping comparison for the 6 of the 7 A's that you already matched. A smarter algorithm could probably skip ahead a little more, but only constants worth of more, as it can't be better than linear.