Search code examples
pythonstringpython-3.xunicodecase-folding

How should I remove one string from the start of another given that I know the longer string matches case-insensitively?


Suppose I have a workflow that involves examining the start of a long string (LS, say) to see whether it begins with a shorter string SS. If it does, I chop off the matching part of LS and do something with the remaining part. Otherwise, I do something else. (The specific case that prompted this question was a parsing library.)

def do_thing(LS, SS):
    if (LS.startswith(SS)):
        action_on_match(LS[len(SS):])
    else:
        action_on_no_match()

This is straightforward. Now, though, suppose that I want to do the same thing but this time I want the strings to be matched case-insensitively. It is possible to test whether "LS.startswith(SS) but case-insensitively". But how should I determine how much of LS to "chop off" when I pass it in to action_on_match()? It isn't sufficient to just use len(SS) as it was before, because if I'm uppercasing or lowercasing or casefolding things, then the length of the matching prefix of LS might not be what I expect: changing the case of a string can change its length. It is important that the part of LS passed to action_on_match() be exactly what the program received as input (after the cutoff point, of course).


Answerers have suggested using lower() and preserving the use of len(SS), but this will not work:

Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  6 2014, 22:15:05) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> def action_on_match (s): return "Match: %s" % s
...
>>> def action_on_no_match (): return "No match"
...
>>> def do_thing (LS, SS):
...     if LS.lower().startswith(SS.lower()):
...         return action_on_match(LS[len(SS):])
...     else:
...         return action_on_no_match()
...
>>> do_thing('i\u0307asdf', '\u0130')
'Match: \u0307asdf'
>>>

Here we expect to see 'Match: asdf', but there is an extra character.


Solution

  • Simple enough:

    def do_thing(LS, SS):
        if LS.lower().startswith(SS.lower()):
            action_on_match(LS[len(SS):])
        else:
            action_on_no_match()
    

    All I'm doing is lower-casing both LS and SS and then comparing them. This will be much slower than a regex solution for very long strings, as it has to convert the entire string to lowercase first.

    A regex solution would look like this:

    import re
    
    def do_thing(LS, SS):
        if re.match("^%s" % SS, LS, re.I):
            action_on_match(LS[len(SS):])
        else:
            action_on_no_match()
    

    Performance

    For short strings (len(LL) == 8 characters) over 1000000 iterations:

    • lower() method: 0.86s (winner)
    • re method: 1.91s

    For long strings (len(LL) == 600 characters) over 1000000 iterations:

    • lower() method: 2.54s
    • re method: 1.96s (winner)

    Unicode combining characters

    For unicode combining characters, the data needs to be normalised first. This means converting any precomposed character into its component parts. You will find for example:

    >>> '\u0130' == 'I\u0307'
    False
    >>> normalize("NFD", '\u0130') == normalize("NFD", 'I\u0307')
    True
    

    You will need to perform this normalisation process on your inputs:

    SS = normalize("NFD", SS)
    LS = normalize("NFD", LS)