Search code examples
c++charlanguage-lawyer

Is char ch = '\xe4' unspecified or implementation defined


I am learning C++ using the books listed here. In particular, I read here that:

If the value represented by a single hexadecimal escape sequence does not fit the range of values represented by the character type used in this string literal (char, char8_t, (since C++20)char16_t, char32_t, (since C++11)or wchar_t), the result is unspecified.

(emphasis mine)

This means that, in a system where char is signed, the result of '\xe4' will be unspecified. But here the person says that "it is implementation defined and not unspecified".

So, my question: Is the behavior of the below statements unspecified or implementation-defined? That is, is this an error in cppreferene's documentation or have I understood it incorrectly.

char arr[] = {'\xe4','\xbd','\xa0','\xe5','\xa5','\xbd','\0'}; //unspecified or implementation defined 
char ch = '\xef';                                              //unspecified or implementation defined

Solution

  • This can be either implementation defined (as per C++17) or (probably) well defined (as per C++23).

    In C++17 (or earlier?), according to this Draft Standard:

    5.13.3 Character literals        [lex.ccon]


    8     … The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for character literals with no prefix) or wchar_t (for character literals prefixed by L). …

    However, from this Draft C++23 Standard (also §5.3.13, [lex.ccon]):

    3.2.3     Otherwise, if the character-literal's encoding-prefix is absent or L, and v does not exceed the range of representable values of the corresponding unsigned type for the underlying type of the character-literal's type, then the value is the unique value of the character-literal's type T that is congruent to v modulo 2N, where N is the width of T.

    So, in your case, as long as the value of the escaped sequence is representable by an unsigned char, then there is neither undefined nor implementation-defined behaviour, as of C++23. However, if that value is outside the range of that unsigned equivalent, then the literal is ill-formed:

    3.2.4     Otherwise, the character-literal is ill-formed.


    Note: This C++20 Draft Standard has the same clause as the above-cited C++17 version (although it's paragraph 7, rather than 8).