We read in the C18 standard:
5.1.1.2 Translation phases
The precedence among the syntax rules of translation is specified by the following phases.
- Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary.
Meaning that the source file character set is decoded and mapped to the source character set.
But then you can read:
5.2.1 Character sets
Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set).
Meaning that the source file character set is the source character set.
So the question is: which one did I understand wrong, or which one is actually wrong?
EDIT: Actually I was wrong. See my answer below.
Meaning that the source file character set is decoded and mapped to the source character set.
No, it does not mean that. My take is that the source is already assumed to be written in the source character set - how exactly would it make sense to "map the source character set to the source character set"? Either they are part of the set or they aren't. If you pick the wrong encoding for your source code, it will simply be rejected before the preprocessing even starts.
Translation phase 1 does two things not quite related to this at all:
Resolves trigraphs, which are standardized multibyte sequences.
Map multibyte characters into the source character set (defined in 5.2.1).
The source character set consists of the basic character set which is essentially the Latin alphabet plus various common symbols (5.2.1/3), and an extended character set, which is locale- and implemention-specific.
The definition of multibyte characters is found at 5.2.1.2:
The source character set may contain multibyte characters, used to represent members of the extended character set. The execution character set may also contain multibyte characters, which need not have the same encoding as for the source character set.
Meaning various locale-specific oddball special cases, such as locale-specific trigraphs.
All of this multibyte madness goes back to the first standardization in 1990 - according to anecdotes from those who were part of that committee, this was because members from various European countries weren't able to use various symbols on their national keyboards.
(I'm not sure how widespread the AltGr key on such keyboards was at the time. It remains a key subject to some serious button mashing when writing C on non-English keyboards anyway, to get access to {}[]
symbols etc.)