Consider the following sequence of bytes in hexadecimal representation (ASCII interpretations, if any, in the second column as a reading aide):
0x73 s
0x74 t
0x61 a
0x74 t
0x69 i
0x63 c
0x5f _
0x61 a
0x73 s
0x73 s
0x65 e
0x72 r
0x74 t
0x28 (
0x55 U
0x27 '
0xe2
0x84
0xab
0x27 '
0x3d =
0x3d =
0x55 U
0x27 '
0xc3
0x85
0x27 '
0x29 )
0x3b ;
Decoded as UTF-8 this byte sequence reads
static_assert(U'Å'==U'Å');
Note that on the left side Å
is Unicode scalar value
0x212B ANGSTROM SIGN
and on the right-hand side Å
is Unicode scalar value
0x00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
Is the assertion supposed to fail in C++23 when the byte sequence is interpreted as a source in the mandatorily-supported UTF-8 encoding?
In translation phase 1, after decoding the UTF-8 sequence to a Unicode scalar value sequence, these scalar values are supposed to be mapped to elements of the translation character set to form a sequence of translation character set elements, see [lex.phases]/1.1. According to [lex.charset]/1.1 the elements of the translation character set are, with the exception of unassigned scalar values, the abstract characters which have an assigned Unicode code point.
The closest definition I could find for abstract character is in the Unicode standard. However, according to its chapter 3.4. D11 an abstract character can be assigned multiple code points and it gives the Angstrom character as an example. (EDIT: Carefully reading again, it doesn't say "assigned", just "correspond to".)
If this is the definition of abstract character meant in the C++ standard draft, isn't there then supposed to be only one element in the translation character set which is equivalent to the single abstract character represented by both the code points 0x212B and 0x00C5? If so, shouldn't then the value of both character literals be the same since the value is derived from the translation character set element which doesn't retain any information about the original scalar value?
This does not seem intended to me. Does Unicode even provide complete information on which code points refer to the same abstract character? But then, what exactly is meant by abstract character in the standard draft?
This question is really about what "abstract character" really means. That's defined by the Unicode standard.
You cited that an abstract character may map to multiple code points. Or even codepoint sequences.
The problem is that the rest of the standard doesn't seem to agree.
If you look at the Unicode tables (also defined in the Unicode standard), there is no specification on "U+212B" or "U+00C5" that they code to the same abstract character. The entry for U+212B says:
• preferred representation is 00C5 Å ≡ 00C5 Å latin capital letter a with ring above
However, the ≡ symbol is defined to mean, "canonical decomposition mapping". And if you head to the glossary to look that up, you'll find that this says nothing about what the abstract character is.
In fact, if you look around the glossary, you may stumble upon the definition of "character name":
Character Name. A unique string used to identify each abstract character encoded in the standard. (See definition D4 in Section 3.3, Semantics.)
So, every "abstract character" "encoded in the standard" has a "unique string" associated with it.
Therefore, if "U+212B" and "U+00C5" have different "character name" properties, they must be different abstract characters.
And if you look them up in the Unicode Character Database, they do in fact have different "character names". Ergo, they are different "abstract characters", which have different Unicode code-points and therefore do not compare equal.
This contradicts the example given in the quoted part of the Unicode standard. So the problem is that the Unicode standard itself is inconsistent. The database that defines the mapping is inconsistent with part of the text.
It may well be that this is the only place in the standard where it claims that multiple code points map to the same abstract character.
That being said, I would say that the C++ standard should use the term "encoded character" rather than "abstract character". The former clearly and unequivocally refers to a specific code point assigned to a character. Note that even the definition of "encoded character" does not recognize the possibility of multiple code points mapping to an abstract character: "between an abstract character and a code point." Those are both singular.