Search code examples
calgorithmsubstringknuth-morris-prattrabin-karp

O(n) substring algorithm


so i've been researching about substring searching algorithms and found out that most algorithms like the kmp and the rabin-karp algorithm need an extra amount of time complexity for preprocessing time before doing some string matching. is there any benefit in doing so? and why wouldn't they simply skip to the string matching instantly so that the big-O time complexity does not drop to O(m+n)? I tried creating a substring algorithm that i believe is O(n) (please correct me if i'm wrong), by simply skipping the preprocessing time. And i'm wondering why do people don't do it this way instead, please refer to the C code below.

int search(char hay[], char needle[], int hayLen, int needleLen){
    int found;
    int i = 0;

    while (i < (hayLen - needleLen + 1)){
        if (hay[i] == needle[0]){
            found = 1;
            for (int j=0; j<needleLen; j++){
                if (hay[i] != needle[j]){
                    found = 0;
                    break;
                }
                i++;
            }
            if (found)
                return i - needleLen;
        }
        else
            i++;
    }
    return -1;
}

edit:

removed the strlen function to avoid any unwanted time complexities


Solution

  • Honestly not a terrible question. I think most of us have tried making a solution like this when trying to make a string-finding algorithm before discovering KMP. The answer is that this greedy algorithm doesn’t work — it never goes backwards in i. You may think “aha! this is the start of the needle!” and progress forwards until discovering “uh-oh! this isn’t the whole needle!”. In this algorithm, we then progress only forwards, continuing to search for the start of the needle. However, the start of the actual needle may have been what you thought was a middle character while trying to greedily match as much of the needle as possible.

    For example, aab and aaab. It’s not until the third a that you realize “uh-oh, this isn’t the needle after all”, and a thorough O(nm) algorithm then starts again from the second position, but your algorithm just marches forward, and never realizes the aab that starts on the second position. KMP solves this by kind of noting which parts of the needle in the middle could also be potential starting points for the needle.