Search code examples
c++stringcharacter-encodingcharnon-ascii-characters

How does string work with non-ascii symbols while char does not?


I understand that char in C++ is just an integer type that stores ASCII symbols as numbers ranging from 0 to 127. The Scandinavian letters 'æ', 'ø', and 'å' are not among the 128 symbols in the ASCII table.

So naturally when I try char ch1 = 'ø' I get a compiler error, however string str = "øæå" works fine, even though a string makes use of chars right?

Does string somehow switch over to Unicode?


Solution

  • From the source code char c = 'ø';:

    source_file.cpp:2:12: error: character too large for enclosing character literal type
      char c = '<U+00F8>';
               ^
    

    What's happening here is that the compiler is converting the character from the source code encoding and determining that there's no representation of that character using the execution encoding that fits inside a single char. (Note that this error has nothing to do with the initialization of c, it would happen with any such character literal. examples)

    When you put such characters into a string literal rather than a character literal, however, the compiler's conversion from the source encoding to the execution encoding is perfectly happy to use multi-byte representations of the characters when the execution encoding is multi-byte, such as UTF-8 is.

    To better understand what compilers do in this area you should start by reading clauses 2.3 [lex.charsets], 2.14.3 [lex.ccon], and 2.14.5 [lex.string] in the C++ standard.