I have a Unicode string encoded, say, as UTF8
. One string in Unicode can have few byte representations. I wonder, is there any or can be created any canonical (normalized) form of Unicode string -- so we can e.g. compare such strings with memcmp(3)
etc. Can e.g. ICU or any other C/C++
library do that?
You might be looking for Unicode normalisation. There are essentially four different normal forms that each ensure that all equivalent strings have a common form afterwards. However, in many instances you need to take locale into account as well, so while this may be a cheap way of doing a byte-to-byte comparison (if you ensure the same Unicode transformation format, like UTF-8 or UTF-16 and the same normal form) it won't gain you much apart from that limited use case.