What is the encoding of unprefixed string literals in C++? For example, all string literals are parsed and stored as UTF-16 in Java, as UTF-8 in Python3. I guess this is the case with C++ u8""
literals. But I'm not clear about normal literals like ""
.
What should be the output of following code?
#include <iostream>
#include <iomanip>
int main() {
auto c = "Hello, World!";
while(*c) {
std::cout << std::hex << (unsigned int){*c++} << " ";
}
}
When I run this in my machine, it gives following output:
48 65 6c 6c 6f 2c 20 57 6f 72 6c 64 21
But is this guarantied? Cppreference page for string literals says that characters inside normal string literals are from the translation character set, and translation character set states that:
The translation character set consists of the following elements:
- each character named by ISO/IEC 10646, as identified by its unique UCS scalar value, and
- a distinct character for each UCS scalar value where no named character is assigned.
From this definition, it seems translation character set refers to Unicode (or its superset). Then is there no difference between ""
and u8""
except for explicitness?
Suppose if I want my string to be in EBCDIC encoding (just as an exercise), what is the correct way to achieve it in C++?
EDIT: The linked Cppreference page for string literals does say that it is implementation defined. Does that mean, should I avoid using them?
Encoding of string literals is controlled by compiler settings. Default settings depend on compiler. AFAIK by default MSVC uses encoding defined by system locale. On gcc/clang utf-8 is assumed.
In MSVC you can change this by using /execution-charset:
switch.
Gcc clang have -fexec-charset=
switch.
Note you haveto instruct standard library what is current encoding of your string literals. This is one of features of std::locale::global
.
Here is my other answer where I did some experiments with MSVC.