c++cc++11language-lawyerc11

C11 & C++11 Exended and Universal Character Escaping


Context

C11 and C++11 both support extended characters in source files, as well as universal character names (UCNs), which allow one to enter characters not in the basic source character set using only characters that are.

C++11 also defines several translation phases of compilation. In particular, extended characters are normalized to UCNs in the very first phase of translation, described below:

§ C++11 2.2p1.1:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences (2.4) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)


Question

Does a Standard-conforming compilation of the program

#include <stdio.h>

int main(void){
        printf("\é\n");
        printf("\\u00e9\n");
        return 0;
}

fail, compile and print

é
é

or compile and print

\u00e9
\u00e9

, when run?


Informed Personal Opinion

It is my contention that the answer is that it successfully compiles and prints \u00e9, since by §2.2p1.1 above, we have

An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal., and we are not in a raw string literal.

It then follows that

  • In Phase 1, printf("\é\n"); maps to printf("\\u00e9\n");.
  • In Phase 3, The source file is decomposed into preprocessing tokens (§2.2p1.3), of which the string-literal "\\u00e9\n" is one.
  • In Phase 5, Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set (§2.2p1.5). Thus, by the maximal munch principle, \\ maps to \, and the fragment u00e9 is not recognized as a UCN and therefore prints as is.

Experiments

Unfortunately, extant compilers disagree with me. I've tested with both GCC 4.8.2 and Clang 3.5, and here is what they gave me:

  • GCC 4.8.2

    g++ -std=c++11  -Wall -Wextra ucn.cpp -o ucn
    

    Output:

    ucn.cpp: In function 'int main()':
    ucn.cpp:4:9: warning: unknown escape sequence: '\303' [enabled by default]
      printf("\é\n");
             ^
    
    ./ucn
    

    Output:

    é
    \u00e9
    
  • Clang 3.5

    clang++ -std=c++11  -Wall -Wextra ucn.cpp -o ucn
    

    Output:

    ucn.cpp:4:10: warning: unknown escape sequence '\xFFFFFFC3' [-Wunknown-escape-sequence]
            printf("\é\n");
                    ^
    ucn.cpp:4:12: warning: illegal character encoding in string literal [-Winvalid-source-encoding]
            printf("\é\n");
                     ^
    2 warnings generated.
    
    ./ucn
    

    Output:

    é
    \u00e9
    

I have double- and -triple checked that the é character appears as C3 A9 using hexdump -C ucn.cpp, in agreement with the expected UTF-8 encoding. I've moreover verified that a plain printf("é\n"); or printf("\u00e9\n"); works flawlessly, so this is not a problem of the compilers tested being unable to read UTF-8 source files.

Who's right?


Solution

  • 'é' is not a valid character to backslash escape in a string literal, and so a backslash followed by 'é' as either a literal source character or a UCN should produce a compiler diagnostic and undefined behavior.

    Note, however, that "\\u00e9" is not a UCN preceded by a backslash, and that it's not possible to write any sequence of basic source characters in a string or character literal that is a backslash followed by a UCN. Thus "\é" and "\\u00e9" are not required to behave the same: The behavior of "\\u00e9" can be perfectly well defined while the behavior of "\é" is undefined.

    If we were to posit some syntax that allowed backslash escaping a UCN, say "\«\u00e9»", then that would have undefined behavior like "\é".


    • In Phase 1, printf("\é\n"); maps to printf("\\u00e9\n");.

    The phase one conversion of é into a UCN cannot create a non-UCN, such as "\\u00e9".


    The compilers are right, but don't specifically handle this situation with perfect diagnostic messages. Ideally what you'd get is:

    $ clang++ -std=c++11  -Wall -Wextra ucn.cpp -o ucn
    ucn.cpp:4:10: warning: unknown escape sequence '\é' [-Wunknown-escape-sequence]
            printf("\é\n");
                    ^
    1 warnings generated.
    $ ./ucn
    é
    \u00e9
    

    Both compilers specify that their behavior in the presence of an unknown escape sequence is to replace the escape sequence with the character thus escaped, so "\é" would be treated as "é" and the program overall should be interpreted as:

    #include <stdio.h>
    
    int main(void){
            printf("é\n");
            printf("\\u00e9\n");
            return 0;
    }
    

    Both compilers do happen to get this behavior, partially by chance, but also partially because the policy to treat unrecognized escape sequences the way they do is a smart choice: Even though they only see the unrecognized escape sequence as the backslash followed by the byte 0xC3, they remove the backslash and leave the 0xC3 in place, which means the UTF-8 sequence is left intact for later processing.