Search code examples
pythonalgorithmbig-o

What is the average and worst-case time complexity of my string searching algorithm?


I created this algorithm to return the index of the first occurrence of a needle in a haystack, or -1 if the needle is not part of the haystack. It passed on LeetCode with a lower-bound of one and an upper-bound of ten to the power of four for both needle and haystack and people were saying that O(n * m) solutions weren't passing at all, so I assume the average time complexity is somewhat better than that; however, I can't tell for sure as the main while loop's time complexity is tough to understand.

The code is:

def stringSearch(haystack, needle):
    s = []
    
    if len(needle) == 1:
        for i,n in enumerate(haystack):
            if n == needle:
                return i
        return -1

    for i,n in enumerate(haystack):
        if n == needle[0] and (len(haystack) - i) >= len(needle):
            s.append(i)
    
    new = []
    for i in s:
        if i + 1 < len(haystack) and haystack[i + 1] == needle[1]:
            new.append((i, i + 1, 2))
    if new and len(needle) == 2:
        return new[0][0]

    while 1:
        temp = []
        for first, start, check in new:
            if haystack[start + 1] == needle[check]:
                temp.append((first, start + 1, check + 1))
                if check + 1 >= len(needle):
                    return first
        if not temp:
            return -1
        new = temp
stringSearch("", "")

The worst-case is evidently cases where haystack is a giant repeating pattern, and needle follows that pattern in such cases as "hahahahahahahah" and "haha" or "aaaaaaaaaaaaaaaaaaaaa" and "aa".


Solution

  • Your algorithm essentially performs a breadth-first search, where you extend all partial matches you found in a previous iteration by one character and keep those that still match. So for a worst case scenario we would want to have as many partial matches as possible for as many iterations as possible.

    One such worst case input could be a haystack with an even size, with 𝑛 times the letter "a", and a needle with 𝑚=𝑛/2 characters that are all "a", except the last letter, which would be a "b". So for instance, if 𝑛 is 10:

    haystack = "aaaaaaaaaa"
    needle   = "aaaab"
    

    The while 1: loop will make 𝑛/2−2 iterations (the "minus-2" is the consequence of already having matches with 2 letters before that loop starts).

    The inner for first, start, check in new loop will make 𝑛/2+1 iterations

    Taking that together, the inner if statement will be executed (𝑛/2-2)(𝑛/2+1) times, i.e. 𝑛²/4−𝑛/2−2 times, which is O(𝑛²), and is therefore the time complexity of your algorithm.

    people were saying that O(n * m) solutions weren't passing at all

    You have given proof that their claim is false. Here it is important to realise that time complexity says absolutely nothing about how much time an algorithm will need to process actual (finite!) inputs. Time complexity is about asymptotic behaviour.