Search code examples
c++character-encodingwchar-t

Printing Latin characters in Linux terminal using `std::wstring` and `std::wcout`


I'm coding in C++ on Linux (Ubuntu) and trying to print a string that contains some Latin characters.

Trying to debug, I have something like the following:

std::wstring foo = L"ÆØÅ";
std::wcout << foo;
for(int i = 0; i < foo.length(); ++i) {
    std::wcout << std::hex << (int)foo[i] << " ";
    std::wcout << (char)foo[i];
}

Characteristics of output I get:

  • The first print shows: ???
  • The loop prints the hex for the three characters as c6 d8 c5
  • When foo[i] is cast to char (or wchar_t), nothing is printed

Environmental variable $LANG is set to default en_US.UTF-8


Solution

  • In the conclusion of the answer I linked (which I still recommend reading) we can find:

    When I should use std::wstring over std::string?

    On Linux? Almost never, unless you use a toolkit/framework.

    Short explanation why:

    First of all, Linux is natively encoded in UTF-8 and is consequent in it (in contrast to e.g. Windows where files has one encoding and cmd.exe another).

    Now let's have a look at such simple program:

    #include <iostream>
    
    int main()
    {
        std::string  foo =  "ψA"; // character 'A' is just control sample
        std::wstring bar = L"ψA"; // --
    
        for (int i = 0; i < foo.length(); ++i) {
            std::cout  << static_cast<int>(foo[i]) << " ";
        }
        std::cout << std::endl;
    
        for (int i = 0; i < bar.length(); ++i) {
            std::wcout << static_cast<int>(bar[i]) << " ";
        }
        std::cout << std::endl;
    
        return 0;
    }
    

    The output is:

    -49 -120 65 
    968 65 
    

    What does it tell us? 65 is ASCII code of character 'A', it means that that -49 -120 and 968 corresponds to 'ψ'.

    In case of char character 'ψ' takes actually two chars. In case of wchar_t it's just one wchar_t.

    Let's also check sizes of those types:

    std::cout << "sizeof(char)    : " << sizeof(char)    << std::endl;
    std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl;
    

    Output:

    sizeof(char)    : 1
    sizeof(wchar_t) : 4
    

    1 byte on my machine has standard 8 bits. char has 1 byte (8 bits), while wchar_t has 4 bytes (32 bits).

    UTF-8 operates on, nomen omen, code units having 8 bits. There is is a fixed-length UTF-32 encoding used to encode Unicode code points that uses exactly 32 bits (4 bytes) per code point, but it's UTF-8 which Linux uses.

    Ergo, terminal expects to get those two negatively signed values to print character 'ψ', not one value which is way above ASCII table (codes are defined up to number 127 - half of char possible values).

    That's why std::cout << char(-49) << char(-120); will also print ψ.


    But it shows the const char[] as printing correctly. But when I typecast to (char), nothing is printed.

    The character was already encoded different, there are different values in there, simple casting won't be enough to convert them.

    And as I've shown, size char is 1 byte and of wchar_t is 4 bytes. You can safely cast upward, not downward.