Search code examples
c++cunicodecollationunicode-normalization

Canonical Unicode string form


I have a Unicode string encoded, say, as UTF8. One string in Unicode can have few byte representations. I wonder, is there any or can be created any canonical (normalized) form of Unicode string -- so we can e.g. compare such strings with memcmp(3) etc. Can e.g. ICU or any other C/C++ library do that?


Solution

  • You might be looking for Unicode normalisation. There are essentially four different normal forms that each ensure that all equivalent strings have a common form afterwards. However, in many instances you need to take locale into account as well, so while this may be a cheap way of doing a byte-to-byte comparison (if you ensure the same Unicode transformation format, like UTF-8 or UTF-16 and the same normal form) it won't gain you much apart from that limited use case.