Search code examples
c++unicodeclangstandardsc++17

How to specify the endianness of utf-16 string literals in C++17 with Clang?


UTF-16 string literals, such as auto str = u"中国字";, are allowed in modern C++ source code.

UTF-16 has two endiannesses: UTF-16LE and UTF-16BE. The C++ standard doesn't specify the endianness of UTF-16 string literals. So, I think it is implementation-defined.

Is there any way to specify the endianness at compile-time?


Solution

  • A string literal prefixed with u is an array of const char16_t values:

    C++17 [lex.string]/10:

    A string-literal that begins with u , such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it is initialized with the given characters.

    So the literal in the quote is equivalent to, on a Unicode system:

    const char16_t x[] = { 97, 115, 100, 102, 0 };
    

    In other words, the representation of the string literal is the same as the representation of that array.

    For a more complicated string, it is still an array of const char16_t; and there may be multiple code points per c-char, i.e. the number of elements in the array might be more than the number of characters that seem to appear in the string.


    To answer the question in the title: I'm not aware of any compiler option (for any compiler) that would let you configure the endianness of char16_t. I would expect any target system to use the same endianness for all the integral types. char16_t is supposed to have the same properties as uint_least16_t ([basic.fundamental]/5).

    If your code contains string literals and you want to write them into a file as specifically UTF16-BE for example, you'll need to do the usual endian checks/adjustments in case your system stores char16_t in little endian form.