Search code examples
c++unicodecharacter-encodingcharliterals

C++23: char now supports Unicode?


Does C++23 now provide support for Unicode characters in its basic char type, and to what degree?


So on cppreference for character literals, a character literal:

'c-char'

is defined as either:

  • a basic-c-char
  • an escape sequence, as defined in escape sequences
  • a universal character name, as defined in escape sequences

and then for basic-c-char, it's defined that:

A character from the basic source character set (until C++23) translation character set (since C++23), except the single-quote ', backslash \, or new-line character

On the cppreference's page for character sets, it then defines the "translation character set" as consisting of the following:

  • each abstract character assigned a code point in the Unicode codespace, and (since C++23)
  • a distinct character for each Unicode scalar value not assigned to an abstract character.

and states:

The translation character set is a superset of the basic character set and the basic literal character set (see below).

It seems to me that the "basic character set" (given on the just-above page) is basically a subset of ASCII. I also always thought of char as namely being ASCII (with support for ISO-8859 character sets, such as per Microsoft's page on the character types). But now with the change to the translation character set for basic-c-char, it seems it supports Unicode to some fuller extent.

I'm aware that the actual encoding is implementation defined (apart from the null character and incrementing decimal digit characters it seems). But my main question is what characters are really supported by this "translation character set"? Is it all of Unicode? I feel as though I'm reading more into this than is actually the case.


Solution

  • Effectively not much changed (with two important differences):

    Before C++23 the first translation phase defined that any character in the source file that isn't an element of the basic source character set (which is a subset of the ASCII character set) was to be mapped to a universal-character-name, i.e. it would be replaced by a sequence of the form \UXXXXXXXX where XXXXXXXX is the number of the ISO/IEC 10646 (equivalently Unicode) code point for the character.

    Then when writing a character literal 'X' where X is replaced with a character that is not in the basic source character set you would get '\UXXXXXXXX' after the first translation phase and then the c-char -> universal-character-name grammar applied.

    So you could always write non-ASCII characters in a character literal, assuming the source encoding permitted to write such character. Source file encoding and supported source characters outside the basic source character set were implementation-defined as the source character set (encoding). Regardless of source character set, you could already write any Unicode scalar value directly into a character literal with a universal character name.

    How this character literal will then behave is a different question, because the encoding used for to determine the value of the char from the universal-character-name (or any character of the basic source character set) is implementation-defined as well (the execution character set encoding in C++20 or ordinary literal encoding in C++23). Obviously if char is 8bit wide it can't represent all Unicode scalar values. If the character was not representable in char, then the behavior was implementation-defined.

    The changes for C++23 are now that support for UTF-8 source encoding became mandatory, implying support for all Unicode scalar values in the source file, (although other encodings can of course also be supported) and that the first phase was changed, so that instead of rewriting everything to the basic source character set via universal character names, now the source characters are mapped to a translation character set sequence which is essentially a Unicode scalar value sequence. Unicode code points that are not Unicode scalar value, i.e. surrogate code points, aren't elements of the translation character set (and can't be produced by decoding any source file).

    Therefore, in C++23 when getting to the translation phase where the character literal's value is determined, a single Unicode scalar value in the source file matches the basic-c-char grammar as you showed in your question.

    The value of the character literal is still determined as before by implementation-defined encoding. However, in contrast to C++20, the literal is now ill-formed if the character is not representable in char via this encoding.

    So the two differences are that UTF-8 source file encoding must be supported and that a single source character (meaning a single Unicode scalar value) in the character literal that is not representable in the implementation-defined ordinary literal encoding will now cause the literal to be ill-formed instead of having an implementation-defined value.


    Analogously to the above, string literals (rather than character literals) haven't really changed either. The encoding is still implementation-defined using the same ordinary literal encoding and primarily only the internal representation in the translation phases changed. And in the same way as for character literals, with C++23 the literal becomes ill-formed if a character (i.e. translation character set element or Unicode scalar value) is not representable in the ordinary literal character encoding. However that encoding may be e.g. UTF-8, so that a single Unicode scalar value in the source file may map to multiple char in the encoded string, as has always been the case.