Search code examples
casetime-complexityknuth-morris-pratt

Why is the normal case complexity of KMP O(N)?


after finally having understand the KMP-algorithm, its complexity remains unclear. For the length of the pattern M and for the length of the text N, I know, in worst case it's O(N+M) (still can't explain, why), but why is it O(N) in the average case? Thank you in advance for help.


Solution

  • In this answer I am going to address this part of your question.

    For the length of the pattern M and for the length of the text N, I know, in worst case it's O(N+M) (still can't explain, why)

    I will try to explain why computing the failure function(the pre-processing step in KMP algorithm) takes linear time, and you can use similar analysis for the running time of the string matching part of the algorithm.

    Code(taken from here):

    KNUTH-MORRIS-PRATT FAILURE (P)

    Input : Pattern with N characters
    Output: Failure function f for pattern P
    
    i ← 1
    matched ← 0
    f(0) ← 0
    while i < N do
     if P[matched] = P[i]
         f(i) ← j +1
         i ← i +1
         matched← matched + 1
     else if matched>0
         matched ← f(matched - 1)
     else
         f(i) ← 0
         i ← i +1
    

    Running Time of computing failure function = max number of iterations of the while loop.

    Number of iterations of while loop = Number of successful matches + Number of unsuccessful matches

    Let's call this statement (1)

    Let N = length of pattern.

    Number of successful matches<=N
    (Check for instance, this pattern "aaaa")

    Let's call this statement (2)

    Number of unsuccessful matches is a little tricky.

    Let's take an example pattern "aaabcaaabce"

    For this pattern the array is:

    | 0 | 1 | 2 | 0 | 0 | 1 | 2 | 3 | 4 | 5 | 0 |

    Notice that in index 3, when mismatch occurs, when we go down to value 0, we can do at most 2 mismatches, or in other words, we can only do as many mismatches as the number of characters matches so far.

    So every time mismatch occurs at index i, the maximum number of mismatches at that index i can be at most the number of characters matched so far(at index i-1).

    Intuitively the number of characters matched over the complete array is just the entries in the array which are non-zero. Because a non-zero entry means a match. Since,

    Maximum number of unsuccessful matches = number of characters matched

    => Maximum number of unsuccessful matches over the entire array

    = maximum number of character matches over the array

    = number of entries in array which are non-zero

    <= N

    Thus, Maximum number of unsuccessful matches over the array <=N

    Let's call this statement (3)

    Putting (1), (2) and (3) together:

    Total running time = running time for successful matches + running time for unsuccessful matches

    Running time <= N+N

    Running time <= 2*N

    Running time for computing failure function = O(N)