I'm coding in C++ on Linux (Ubuntu) and trying to print a string that contains some Latin characters.
Trying to debug, I have something like the following:
std::wstring foo = L"ÆØÅ";
std::wcout << foo;
for(int i = 0; i < foo.length(); ++i) {
std::wcout << std::hex << (int)foo[i] << " ";
std::wcout << (char)foo[i];
}
Characteristics of output I get:
???
c6 d8 c5
foo[i]
is cast to char
(or wchar_t
), nothing is printedEnvironmental variable $LANG
is set to default en_US.UTF-8
In the conclusion of the answer I linked (which I still recommend reading) we can find:
When I should use
std::wstring
overstd::string
?On Linux? Almost never, unless you use a toolkit/framework.
First of all, Linux is natively encoded in UTF-8 and is consequent in it (in contrast to e.g. Windows where files has one encoding and cmd.exe
another).
Now let's have a look at such simple program:
#include <iostream>
int main()
{
std::string foo = "ψA"; // character 'A' is just control sample
std::wstring bar = L"ψA"; // --
for (int i = 0; i < foo.length(); ++i) {
std::cout << static_cast<int>(foo[i]) << " ";
}
std::cout << std::endl;
for (int i = 0; i < bar.length(); ++i) {
std::wcout << static_cast<int>(bar[i]) << " ";
}
std::cout << std::endl;
return 0;
}
The output is:
-49 -120 65
968 65
What does it tell us? 65
is ASCII code of character 'A', it means that that -49 -120
and 968
corresponds to 'ψ'.
In case of char
character 'ψ' takes actually two char
s. In case of wchar_t
it's just one wchar_t
.
Let's also check sizes of those types:
std::cout << "sizeof(char) : " << sizeof(char) << std::endl;
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl;
Output:
sizeof(char) : 1
sizeof(wchar_t) : 4
1 byte on my machine has standard 8 bits. char
has 1 byte (8 bits), while wchar_t
has 4 bytes (32 bits).
UTF-8 operates on, nomen omen, code units having 8 bits. There is is a fixed-length UTF-32 encoding used to encode Unicode code points that uses exactly 32 bits (4 bytes) per code point, but it's UTF-8 which Linux uses.
Ergo, terminal expects to get those two negatively signed values to print character 'ψ', not one value which is way above ASCII table (codes are defined up to number 127 - half of char
possible values).
That's why std::cout << char(-49) << char(-120);
will also print ψ
.
But it shows the const char[] as printing correctly. But when I typecast to (char), nothing is printed.
The character was already encoded different, there are different values in there, simple casting won't be enough to convert them.
And as I've shown, size char
is 1 byte and of wchar_t
is 4 bytes. You can safely cast upward, not downward.