Search code examples
pythonstringunicode-normalization

Strings in Python not equal due to German Umlaut


I tried to compare to strings, both contained the German Umlaut "ü". Both look literaly the same, there is also no trailing \n or somethins similar.

enter image description here

One of those bits is read from an xml-File, another from the filesystem. Comparing them letter by letter, shows a difference with the Umlaut.

enter image description here

The distorted Umlaut (consisting of two letters, a normal u and two upper dots) is coming from the file system. I'm using macOS High Sierra and running Python 3.7. The filename is read using os.listdir().

I'd appreciate suggestions to handle this strange behavior (getting rid of the "ü" is not an option).


Solution

  • Instead of comparing the strings directly, compare their unicodedata.normalize results, given the same form parameter

    From the documentation: Comparing strings

    A second tool is the unicodedata module’s normalize() function that converts strings to one of several normal forms, where letters followed by a combining character are replaced with single characters. normalize() can be used to perform string comparisons that won’t falsely report inequality if two strings use combining characters differently

    import unicodedata
    
    def compare_strs(s1, s2):
        def NFD(s):
            return unicodedata.normalize('NFD', s)
    
        return NFD(s1) == NFD(s2)