Search code examples
algorithmoptimizationknuth-morris-pratt

Redundant instruction in Knuth-Morris-Pratt algorithm


The Knuth-Morris-Pratt algorithm aims to find the first (and possibly next) occurrences of a substring in a string. Since the substring can contain repeating parts, it uses some kind of backtracking mechanism. This is the algorithm in pseudocode:

let m ← 0, i ← 0
while m + i < length(S) do
        if W[i] = S[m + i] then
            if i = length(W) - 1 then
                return m
            let i ← i + 1
        else
            if T[i] > -1 then
                let m ← m + i - T[i], i ← T[i]
            else
                let i ← 0, m ← m + 1

(from Wikipedia). With W the substring and S the string to search, both zero-based arrays.

I have a question about the last if clause in the algorithm: if T[i] > -1 then, based on the T-vector construction algorithm, it seems only possible that T[i] is less than zero for index i = 0. In that case one can perform a faster "check" on the index (array access is an additional instruction, especially if one takes possible cache-faults into account) as is the assignment i ← 0.

The construction of T is done by the following algorithm:

let pos ← 2, cmd ← 0
let T[0] ← -1, T[1] ← 0
while pos < length(W) do
    if W[pos-1] = W[cnd] then
        let cnd ← cnd + 1, T[pos] ← cnd, pos ← pos + 1
    else if cnd > 0 then    // (*)
        let cnd ← T[cnd]
    else
        let T[pos] ← 0, pos ← pos + 1

(from Wikipedia).

Now as one can see the algorithm only writes 0 or the value of cnd to T. For the first type of assignment, the statement is trivially true. For the second case, it depends on the value of cmd.

Now the only way cmd can be reduced is by the second case (*), in that case, cmd will become smaller and smaller until its value is zero or less. But since cmd takes values from the already initialized part of the array this can either be 0 or -1. In case cmd is indeed -1, this results in T[pos] being set to 0 because there is an increment before setting the value. In case cmd is zero, there is no problem at all.

A bit more efficient algorithm would thus be:

let m ← 0, i ← 0
while m + i < length(S) do
    if W[i] = S[m + i] then
        if i = length(W) - 1 then
            return m
        let i ← i + 1
    else
        if i > 0 then
            let m ← m + i - T[i], i ← T[i]
        else
            let m ← m + 1

Is this statement correct? If not, can you give a substring where two or more -1s appear in the T-array?


Solution

  • That looks fine to me, although I don't know how much difference it will make in practice. It's certainly true that in common scenarios, the majority of the loops will be precisely the case where i is 0 and the character at position S[m]W[0].

    I don't think the algorithm in Wikipedia is "official" or hyperoptimized; it's intended to be didactic.

    The second branch of the if occurs when a character is encountered which cannot extend any candidate match, and is not the first character of the word being searched for; in this case, it's necessary to move over that character. (That's the common case alluded to earlier.)

    In all other cases when the failure branch is entered, m+i is unaltered. In the success case and in the final failure case, m+i is incremented by exactly one.

    Since min and max are branch-free opcodes on many CPUs, another optimization would be to set T[0] to 0 instead of -1, and change the loop to:

    let m ← 0, i ← 0
    while m + i < length(S) do
        if W[i] = S[m + i] then
            if i = length(W) - 1 then
                return m
            let i ← i + 1
        else
            let m ← m + max(1, i - T[i]), i ← T[i]
    

    But a better optimization would be to use three different loops: one for the common case (i = 0 and S[m] doesn't match W[0]); one for the case where the characters match; and one for the failure case. (The failure case doesn't need to compare m + i against the input length; it only needs to check whether i is 0.)

    For reference, the original paper (available on citeseer) presents the following simple algorithm:

    (* Note: here, m is the length of pattern and n is the length of the input *)
    j := k := 1;
    while j ≤ m and k ≤ n do
        begin
            while j > 0 and text[k] ≠ pattern[j]
                do j := next[j];
            k := k + l; j := j + l;
        end;
    

    However, the authors complain that the above simple algorithm is unnecessarily inefficient, and devote several pages to exploring optimizations.

    See Fast Matching in Strings, 1974, Knuth, Morris & Pratt