Search code examples
c++unicodelanguage-lawyerutf-16string-literals

Assign char16_t with character literal codepoints outside of basic multilingual plane


In a talk I watched on Unicode earlier today, there was some confusion about what should happen when you attempt to assign a character literal that's too long to be represented by the char16_t type. The presenter says that based on a reading of the standard, the program ought to be ill-formed, but that gcc accepts it anyway. He didn't clarify, and Youtube doesn't allow me to ask questions.

My own testing confirms that the following code is accepted by g++-4.8 and g++-4.9. (with warnings.)

int main() {
  char16_t a = u'\U0001F378';
}

http://coliru.stacked-crooked.com/a/6cb2206660407a8d
https://eval.in/188979

On the other hand clang 3.4 generates an error.

Which compiler is correct? I can't find the chapter and verse for this.

Additional small question, the character literal section §2.14.3 does not mention the \u and \U escape sequences in the W-grammar or in the section body. Is this an oversight?


Solution

  • The program is ill-formed and should fail to compile.

    2.14.3/2 A character literal that begins with the letter u, such as u’y’, is a character literal of type char16_t. The value of a char16_t literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point is representable with a single 16-bit code unit. (That is, provided it is a basic multi-lingual plane code point.) If the value is not representable within 16 bits, the program is ill-formed...

    Emphasis mine.

    \u and \U are not escape sequences within the meaning of 2.14.3. They are universal character names, described in 2.3/2. They are not limited to character and string literals, but may appear anywhere in the program:

    int main() {
        int \u0410 = 42;
        return \u0410;
    }
    

    \u0410 is A, aka Cyrillic Capital Letter A.