Search code examples
pythonpython-unicodeunicode-normalization

Different encoding when using unicode character in Python


I'm having problem in Python when met composition unicode instead of built-in unicode. Here is reproduce code:

# encoding=utf8

a = ["Địa"]
b = ["Địa"]

print(a)  # ['\xc4\x90i\xcc\xa3a']
print(b)  # ['\xc4\x90\xe1\xbb\x8ba']

print("Địa" in a)  # False
print("Địa" in b)  # True

How can I convert/normalize them into the same encoder?


Solution

  • You can use unicodedata.normalize():

    # encoding=utf8
    import unicodedata
    a = ["Địa"]
    b = ["Địa"]
    
    print("Địa" in [unicodedata.normalize('NFC', i) for i in a])
    print("Địa" in [unicodedata.normalize('NFC', i) for i in b])
    

    This outputs:

    True
    True