Search code examples
windowsc++11utf-8cmdvisual-c++-2015

C++11 std::cout << "string literal in UTF-8" to Windows cmd console? (Visual Studio 2015)


Summary: What should I do to print correctly a string literal defined in the source code that was stored in UTF-8 encoding (Windows CP 65001) to a cmd console using std::cout stream?

Motivation: I would like to modify the excellent Catch unit-testing framework (as an experiment) so that it would display my texts with accented characters. The modification should be simple, reliable, and should be also useful for other languages and working environments so that it could be accepted by the author as an enhancement. Or if you know Catch and if there is some alternative solution, could you post it?

Details: Let's start with the Czech version of the "quick brown fox..."

#include <iostream>
#include "windows.h"

using namespace std;

int main()
{
    cout << "\n-------------------------- default cmd encoding = 852 -------------------\n";
    cout << "Příšerně žluťoučký kůň úpěl ďábelské ódy!" << endl;

    cout << "\n-------- Windows Central European (1250) set for the cmd console --------\n";
    SetConsoleOutputCP(1250);
    std::cout << "Příšerně žluťoučký kůň úpěl ďábelské ódy!" << std::endl;

    cout << "\n------------- Windows UTF-8 (65001) set for the cmd console -------------\n";
    SetConsoleOutputCP(CP_UTF8);
    std::cout << "Příšerně žluťoučký kůň úpěl ďábelské ódy!" << std::endl;
}

It prints the following (font set to Lucida Console): enter image description here

The cmd default encoding is 852, the default windows encoding is 1250, and the source code was saved using 65001 encoding (UTF-8 with BOM). The SetConsoleOutputCP(1250); changes the cmd encoding (programmatically) the same way as the chcp 1250 does.

Observation: When setting the 1250 encoding, the UTF-8 string literal is printed correctly. I believe it can be explained, but it is really strange. Is there any decent, human, general way to solve the problem?

Update: The "narrow string literal" is stored using Windows-1250 encoding in my case (native Windows encoding for Central European). It seems to be independent on the encoding of the source code. The compiler saves it in the windows native encoding. Because of that, switching cmd to that encoding gives the desired output. It is uggly, but how can I get the native windows encoding programmatically (to pass it to the SetConsoleOutputCP(cpX))? What I need is a constant that is valid for the machine where the compilation happened. It should not be a native encoding for the machine where the executable runs.

The C++11 introduced also u8"the UTF-8 string literal", but it does not seem to fit with SetConsoleOutputCP(CP_UTF8);


Solution

  • This is a partial answer found via hopping the link by luk32 and confirming the Melebius comments (see below the question). This is not the complete answer, and I will be happy to accept your follow-up comment.

    I have just found the UTF-8 Everywhere Manifesto that touches the problem. The point 17. Q: How do I write UTF-8 string literal in my C++ code? says (also explicit for Microsoft C++ compiler):

    However the most straightforward way is to just write the string as-is and save the source file encoded in UTF-8:

                                    "∃y ∀x ¬(x ≺ y)"
    

    Unfortunately, MSVC converts it to some ANSI codepage, corrupting the string. To work around this, save the file in UTF-8 without BOM. MSVC will assume that it is in the correct codepage and will not touch your strings. However, it renders it impossible to use Unicode identifiers and wide string literals (that you will not be using anyway).

    I really like the manifesto. To make it short, using rude words, and possibly oversimplified, it says:

    Ignore the wstring, wchar_t, and the like things. Ignore the codepages. Ignore the string literal prefixes like L, u, U, u8. Use UTF-8 everywhere. Write all literals "naturally". Ensure it is also stored in the compiled binary.

    If the following code is stored with UTF-8 without BOM...

    #include <iomanip>
    #include <iostream>
    #include "windows.h"
    
    using namespace std;
    
    int main()
    {
        SetConsoleOutputCP(CP_UTF8);
        cout << "Příšerně žluťoučký kůň úpěl ďábelské ódy!" << endl;
    
        int cnt = 0;
        for (unsigned int c : "Příšerně žluťoučký kůň úpěl ďábelské ódy!") 
        {
            cout << hex << setw(2) << setfill('0') << (c & 0xff);
            ++cnt;
            if (cnt % 16 == 0)      cout << endl;
            else if (cnt % 8 == 0)  cout << " | ";
            else if (cnt % 4 == 0)  cout << "  ";
            else                    cout << ' ';
        }
        cout << endl;
    }
    

    It prints (should be UTF-8 encoded)...

    enter image description here

    When saving the source as UTF-8 with BOM, it prints a different result...

    enter image description here

    However, the problem remains -- how to set the console encoding programmatically so that the UTF-8 string is printed correctly.

    I gave up. The cmd console is simply crippled, and it is not worth to fix it from outside. I am accepting my own comment only to close the question. If anyone finds a decent solution related to the Catch unit test framework (could be completely different), I will be glad to accept his/her comment as the answer.