Search code examples
pythonstringperformancecase-insensitivestartswith

Case-insensitive string startswith in Python


Here is how I check whether mystring begins with some string:

>>> mystring.lower().startswith("he")
True

The problem is that mystring is very long (thousands of characters), so the lower() operation takes a lot of time.

QUESTION: Is there a more efficient way?

My unsuccessful attempt:

>>> import re;
>>> mystring.startswith("he", re.I)
False

Solution

  • You could use a regular expression as follows:

    In [33]: bool(re.match('he', 'Hello', re.I))
    Out[33]: True 
    
    In [34]: bool(re.match('el', 'Hello', re.I))
    Out[34]: False 
    

    On a 2000-character string this is about 20x times faster than lower():

    In [38]: s = 'A' * 2000
    
    In [39]: %timeit s.lower().startswith('he')
    10000 loops, best of 3: 41.3 us per loop
    
    In [40]: %timeit bool(re.match('el', s, re.I))
    100000 loops, best of 3: 2.06 us per loop
    

    If you are matching the same prefix repeatedly, pre-compiling the regex can make a large difference:

    In [41]: p = re.compile('he', re.I)
    
    In [42]: %timeit p.match(s)
    1000000 loops, best of 3: 351 ns per loop
    

    For short prefixes, slicing the prefix out of the string before converting it to lowercase could be even faster:

    In [43]: %timeit s[:2].lower() == 'he'
    1000000 loops, best of 3: 287 ns per loop
    

    Relative timings of these approaches will of course depend on the length of the prefix. On my machine the breakeven point seems to be about six characters, which is when the pre-compiled regex becomes the fastest method.

    In my experiments, checking every character separately could be even faster:

    In [44]: %timeit (s[0] == 'h' or s[0] == 'H') and (s[1] == 'e' or s[1] == 'E')
    1000000 loops, best of 3: 189 ns per loop
    

    However, this method only works for prefixes that are known when you're writing the code, and doesn't lend itself to longer prefixes.