I tried to compare to strings, both contained the German Umlaut "ü". Both look literaly the same, there is also no trailing \n
or somethins similar.
One of those bits is read from an xml-File, another from the filesystem. Comparing them letter by letter, shows a difference with the Umlaut.
The distorted Umlaut (consisting of two letters, a normal u and two upper dots) is coming from the file system. I'm using macOS High Sierra and running Python 3.7. The filename is read using os.listdir().
I'd appreciate suggestions to handle this strange behavior (getting rid of the "ü" is not an option).
Instead of comparing the strings directly, compare their unicodedata.normalize
results, given the same form
parameter
From the documentation: Comparing strings
A second tool is the unicodedata module’s normalize() function that converts strings to one of several normal forms, where letters followed by a combining character are replaced with single characters. normalize() can be used to perform string comparisons that won’t falsely report inequality if two strings use combining characters differently
import unicodedata
def compare_strs(s1, s2):
def NFD(s):
return unicodedata.normalize('NFD', s)
return NFD(s1) == NFD(s2)