In a talk I watched on Unicode earlier today, there was some confusion about what should happen when you attempt to assign a character literal that's too long to be represented by the char16_t
type. The presenter says that based on a reading of the standard, the program ought to be ill-formed, but that gcc accepts it anyway. He didn't clarify, and Youtube doesn't allow me to ask questions.
My own testing confirms that the following code is accepted by g++-4.8 and g++-4.9. (with warnings.)
int main() {
char16_t a = u'\U0001F378';
}
http://coliru.stacked-crooked.com/a/6cb2206660407a8d
https://eval.in/188979
On the other hand clang 3.4 generates an error.
Which compiler is correct? I can't find the chapter and verse for this.
Additional small question, the character literal section §2.14.3 does not mention the \u
and \U
escape sequences in the W-grammar or in the section body. Is this an oversight?
The program is ill-formed and should fail to compile.
2.14.3/2 A character literal that begins with the letter u, such as u’y’, is a character literal of type char16_t. The value of a char16_t literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point is representable with a single 16-bit code unit. (That is, provided it is a basic multi-lingual plane code point.) If the value is not representable within 16 bits, the program is ill-formed...
Emphasis mine.
\u
and \U
are not escape sequences within the meaning of 2.14.3. They are universal character names, described in 2.3/2. They are not limited to character and string literals, but may appear anywhere in the program:
int main() {
int \u0410 = 42;
return \u0410;
}
\u0410
is A
, aka Cyrillic Capital Letter A.