Search code examples
c++linuxunicodecygwinunicode-normalization

Unicode normalization in strcoll


Do canonically equivalent Unicode strings collate equal? Sometimes.

#include <iostream>
#include <cstring>
#include <clocale>
int main()
{
    std::setlocale(LC_COLLATE, "en_US.UTF-8");
    if (std::strcoll("\xc3\xa9", "e\xcc\x81"))
      std::cout << "FAIL: No Unicode normalization here" << std::endl;
    else
      std::cout << "WIN: Unicode normalization is performed" << std::endl;
}

This program results in a WIN on my Cygwin-ized Windows machine, and FAIL on every Linux system I can get my hands on.

Is this expected behaviour? Are there Linux systems that produce a WIN? What about Mac OS X? FreeBSD?

I know I can normalize and do canonical equivalence with third-party libraries. I'm interested in standard collation rules of UTF-8 locales.

This question is inspired by this one.


Solution

  • To the best of my knowledge, there is no mention of Unicode normalization neither in the C nor in the C++, nor in the POSIX standards.

    Therefore, implementations may leave normalization as something to be done explicitely by the programmer.

    More explicitely, in glibc european locales apparently use ISO 14651 as collation algorithm. The Unicode Collation FAQ implies that ISO 14651 doesn't do normalization: uniform handling of canonical equivalents is listed as a difference between the UCA and ISO 14651.