Search code examples
pythonpython-3.xstring-comparison

String comparison in constant time


I was thinking about faster way to compare two strings. Checking if value exists in Python set (hash table) has constant time. Does it means that looking for a string in a set has also constant time?

print('tya' == 'tya') #O(n)
mySet = set()
mySet.add('tya')
if 'tya' in mySet: #O(1) <-- ???
    print('True')

In more general case, does it mean that I can find sub-string in string in linear time???

def NaiveFind(largeString, subString):
    mySet = set()
    mySet.add(subString)
    index = -1
    start = 0
    end = len(subString)

    while(end < len(largeString)): #O(n-m)
        windowFromLarge = largeString[start:end]
        if(windowFromLarge in mySet): #O(1) <------- LINEAR ???
        #if(windowFromLarge == subString): #O(m)
            return start
        start += 1
        end += 1

    return index


Solution

  • You say

    Checking if value exists in Python set (hash table) has constant time.

    but that's a common oversimplification, made because people don't realize they're doing it or because saying the actual behavior every time would take longer.

    Checking if a value exists in a Python set requires an average-case constant number of hashing operations and equality comparisons, assuming hash collisions don't get out of hand. It does not automatically make hashing operations and equality comparisons constant time.

    Your NaiveFind algorithm is not linear time, because you've neglected the cost of hash computation (and also because string slicing requires a copy in CPython). The Rabin-Karp algorithm uses a refined version of your idea, where the hash is a rolling hash to avoid this problem. The Rabin-Karp algorithm is average-case linear time, as long as hash collisions don't get out of hand. There are also algorithms like Knuth-Morris-Pratt that are guaranteed linear time, and algorithms like Boyer-Moore that can do better than linear time in the usual case.