c++c unicode collation unicode-normalization

Canonical Unicode string form

I have a Unicode string encoded, say, as UTF8. One string in Unicode can have few byte representations. I wonder, is there any or can be created any canonical (normalized) form of Unicode string -- so we can e.g. compare such strings with memcmp(3) etc. Can e.g. ICU or any other C/C++ library do that?

Solution

You might be looking for Unicode normalisation. There are essentially four different normal forms that each ensure that all equivalent strings have a common form afterwards. However, in many instances you need to take locale into account as well, so while this may be a cheap way of doing a byte-to-byte comparison (if you ensure the same Unicode transformation format, like UTF-8 or UTF-16 and the same normal form) it won't gain you much apart from that limited use case.

Understanding the difference in timing of two functions that increment each element of an integer array
Why MSVC generates warning C4127 when constant is used in "while" - C
Executing a user-space function from the kernel space
Why must the variable used to hold getchar's return value be declared as int?
mpirun -np 4 ./a.out doesn't use all my cores (ubuntu 24.04LTS)
Shadowing an iterator inside a for loop has undefined (?) behaviour in C
Sleep for N seconds and wait for keypress
Dereference twice in gdb
Is this declaration UB?
return type defaults to 'int' [-Wimplicit-int]
What is the scope of `fesetround()`?
Which specific optimization flag causes libm functions to be treated as pure?
How to render text in SDL2?
Why would you use 'extern "C++"'?
Strange Behavior Compiler Ignoring NULL Check Unless I Print Something in the if Statement
Fast inverse square root using fixed point instead of floating point
What is the const qualifier attached to in C: the memory area or the pointer?
GCC options for strictest C code?
How to do an explicit fall-through in C
How do compilers treat CONST qualifier when the pointer points to a memory location obtained with malloc()?
C: cmocka headers - how to unittest?
Why in C when I print a double with a one decimal it round it to the next number
Android C to Java SWIG unable to compile: incompatible types: byte cannot be converted to SWIGTYPE_p_uint8_t
GNU Make in Ubuntu giving fatal error: rpc/types.h: No such file or directory
How can I exclude non-numeric keys? CS50 Caesar Pset2
How change every struct in an array of pointers?
Optimized 2x2 matrix multiplication: Slow assembly versus fast SIMD
Simple frame by frame video decoder library
GCC no longer implements <varargs.h>
Contents of IO buffer unknown == unsafe?