I'm having problem in Python when met composition unicode instead of built-in unicode. Here is reproduce code:
# encoding=utf8
a = ["Địa"]
b = ["Địa"]
print(a) # ['\xc4\x90i\xcc\xa3a']
print(b) # ['\xc4\x90\xe1\xbb\x8ba']
print("Địa" in a) # False
print("Địa" in b) # True
How can I convert/normalize them into the same encoder?
You can use unicodedata.normalize()
:
# encoding=utf8
import unicodedata
a = ["Địa"]
b = ["Địa"]
print("Địa" in [unicodedata.normalize('NFC', i) for i in a])
print("Địa" in [unicodedata.normalize('NFC', i) for i in b])
This outputs:
True
True