Consider the follow code snippet:
struct vec2 {
int x;
int y;
};
constexpr vec2 Up{0,1};
constexpr vec2 Down{0,-1};
constexpr vec2 Left{-1,0};
constexpr vec2 Right{1,0};
The above code snippet compiles without issues and is considered valid and proper syntax.
Now consider the following demonstration that is considered illegal or invalid syntax:
struct vec2 {
int x;
int y;
};
constexpr vec2 ↑{0,1}; // Windows Alt Code: Alt+24
constexpr vec2 ↓{0,-1}; // Windows Alt Code: Alt+25
constexpr vec2 ←{-1,0}; // Windows Alt Code: Alt+27
constexpr vec2 →{1,0}; // Windows Alt Code: Alt+26
Compiler Explorer gives these compiler errors:
C3872
I get that they are unqualified and invalid for being identifiers. I'm just looking for clarity as to why they are forbidden by the C++ language and what the standard has to say about them and where it can be found within the language standard. What is the reasoning behind preventing these from being valid identifiers?
The arrow characters have code point values U+2190 to U+2193.
The UCS/Unicode code points which are permitted in identifiers are listed in Table 2 of [lex.name] for C++17 and C++20 (linked here the pre-C++17 draft N4659), in Annex E of the standard for all previous editions (starting with C++98), and with reference to Unicode Standard Annex #31 in [lex.name]/1 for C++23 (linked here the current draft).
In none of these the range is listed as permitted and therefore a compiler should parse them outside character/string literals as non-whitespace single character preprocessor tokens, which should then be rejected as ill-formed tokens.
The list of code points originates from ISO/IEC 10176 "Guidelines for the preparation of programming language standards" by JTC1/SC22/WG20. There is document register for WG20 here.
On a quick look I couldn't find any accessible discussion about the range containing the arrow symbols specifically, but from what I can tell the intention here was not to generally extend the traditional identifier syntax consisting of digits, Latin letters and _
, but only to internationalize the "letters" part of this syntax to allow writing identifiers in native scripts, i.e. the code point ranges allowed in addition to the traditional ones represent (for the most part) letters or other parts of scripts for different languages, but not punctuation or symbols.
I don't think there is a lot of support to include (sequences of) characters as identifiers that visually would be more likely to be viewed as punctuation, operators or symbols. In particular the change to UAX #31 in C++23 results in emojis becoming disallowed in identifiers. According to the relevant proposal P1949 emojis were only allowed because the originally specified ranges for identifiers hadn't had them assigned yet when they were specified.