Given a pattern abababc
, the prefix table is [0,0,1,2,3,4,0]
. However, at ababab
, both abab
and ab
are prefixes. Why do we only consider abab
as a valid prefix?
+---+----------+-------+--------+
| i | P[i] | 𝝥[i] | Prefix |
+---+----------+-------+--------+
| 0 | a | 0 | |
| 1 | ab | 0 | |
| 2 | aba | 1 | a |
| 3 | abab | 2 | ab |
| 4 | ababa | 3 | aba |
| 5 | ababab | 4 | abab | (notice here ab can also be a prefix)
| 6 | abababc | 0 | |
+---+----------+-------+--------+
I cannot think of an example where the longer prefix will fail but the shorter prefix will work. Is there any proof that only longest prefix should be considered? Thanks!
Stripped of the fancy data structures, you can think of KMP as being somewhat analogous to the following (wildly inefficient) algorithm.
def find(pattern, string):
i = j = 0
while j <= len(string):
if string[i:j] == pattern:
return True
if pattern.startswith(string[i:j]):
j += 1
else:
i += 1
return False
In English, we slide a variable-length window over the string left to right, returning true if the window ever contains the pattern and false otherwise. At each step, if the window contains a prefix of the pattern, then we advance the right endpoint. Otherwise, we advance the left endpoint.
This algorithm clearly has no false positives. We prove the following three invariants by induction, which imply that it terminates and also has no false negatives.
The value of (i, j)
increases lexicographically with each iteration.
0 ≤ j - i ≤ len(pattern)
, i.e., the window is valid and never longer than the pattern.
If the pattern occurs at position i'
, then (i, j) ≤lex (i', i' + len(pattern))
, where ≤lex
means lexicographic comparison.
Invariant 1: in each iteration, either i
or j
is incremented.
Invariant 2: if the window is empty, then either the pattern is empty too (and we exit), or the window grows. If the window length equals the pattern length, then either the window contains the pattern (and we exit), or the window shrinks because it cannot contain a prefix of the pattern that is not the pattern. Otherwise, then it's fine for the window to grow or shrink by one.
Together with the exit condition of the while loop, Invariants 1 and 2 imply termination because the algorithm state transitions form a finite directed acyclic graph.
Invariant 3: if (i, j) = (i', i' + len(pattern))
, then we found the pattern (and exit), so assume (i, j) <lex (i', i' + len(pattern))
. Since (i, j + 1)
is the successor of (i, j)
in lexicographic order, we know (i, j + 1) ≤lex (i', i' + len(pattern))
, so that case is fine. Alternatively, if the current window is not a prefix of the pattern, then we infer that i < i'
, since there's no way that the current window extends to contain the pattern. Therefore, we conclude that (i + 1, j) <lex (i', i' + len(pattern))
, since i + 1 ≤ i'
and j ≤ i + len(pattern) < i' + len(pattern)
.
Tying this back to KMP, the way that the prefix table is used is to increment i
until string[i:j]
is a prefix of the pattern again. Using the longest prefix means that we don't skip values of i
. Skipping could break Invariant 3.