Display large UTF-8-encoded strings for standard output decently, despite Windows or MinGW bugs

2nd Update: I found a very simple solution to this actually not that hard problem, only one day after asking. But people seem to be small-minded so there are three close votes already:

Duplicate of "How to use unicode characters in Windows command line?" (1x):

Obviously not, which has been clarified in the comments. This is not about the Windows command line tool, which I do not use.

Unclear what you're asking (1x):

Then you must suffer from functional analphabetism. I cannot be any more concrete when I ask, for example "Is there an easy way to determine whether a char in a std::string is a non-ending part of an UTF-8 symbol?" (marked bold for better visibility, indeed) and state that this would be sufficient to answer the question (and even explain why). Seriously, there are even pictures to show the problem. Furthermore, my own existing answer should clarify it even more. Your own deficiencies are not sufficient to declare something as too hard to understand.

Too broad (1x) ("Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer [...]"):

This must be another issue with functional analphabetism. I stated clearly that a single way to solve the problem (which I have already found) is sufficient. You can identify an adequate answer as follows: Take a look at the accepted answer of my own. Alternatively, use your brain to interprete my well-defined words if you are able to, which several people on this plattform unfortunately seem not.

There is, however, an actual reason to close this question: It has already been solved. But there is no such option for a close vote. So, cleary, Stack Exchange supports that there may be found alternative solutions. Since I am a curious person, I am also interested in alternative ways to solve this. If your lack of intelligence does not cope well with understanding what the problem is and that it is quite relevant under certain environments (e.g. such that use Windows, C++ in Eclipse CDT, UTF-8, but no Visual Studio and no Windows Console), then you can just leave without standing in the way of other people to satisfy their curiosity. Thanks!

1st Update: I used app.exe > out.txt 2>&1 which generates a file without these formatting issues. So the problem is that usually std::cout does this splitting but the underlying control (which receives the char sequence) has to handle correct reassembling? (Unfortunately nothing seems to handle it on Windows, except file streams. So I still need to circumvent this. Preferably without writing to files first and displaying their content -- which of course works.)

On the system that I use (Windows 7; MinGW-w64 (GCC 8.1 for Windows)), there is a bug with std::cout so that UTF-8 encoded strings are printed out before they are reassembled, even if they were disassembled internally by std::cout by passing a large string. The following code explains how the bug seems to behave. Note that, however, the faulty displays appear to be random, i.e. the way std::cout slices up (equal) std::string objects is not equivalent for every execution of the program. But the problems appear consistently at indices which are multiples of 1024, which is how I concluded that behavior.

#include <iostream>
#include <sstream>

void myFaultyOutput();
void simulatedFaultyBehavior();

int main()
{
    myFaultyOutput();
    //simulatedFaultyBehavior();
}

void myFaultyOutput() {
    std::stringstream ss; // Note that ss is built correctly (which could be shown by saving ss.str() to a file).
    ss << "...";
    for (int i = 0; i < 20; i++) {
        for (int j = 0; j < 341; j++)
            ss << u8"\u301A";
        ss << "\n..";
    }
    std::cout << ss.str() << std::endl; // Problem occurs here, with cout.
    // Note that converting ss.str() to UTF-16 std::wstring and using std::wcout results in std::wcout not
    // displaying anything, not even ASCII characters in the future (until restarting the application).
}

// To display the problem on well-behaved systems ; just imagine the output would not contain newlines, while the faulty formatted characters remain.
void simulatedFaultyBehavior() {
    std::stringstream ss;
    int amount = 2000;
    for (int j = 0; j < amount; j++)
        ss << u8"\u301A";
    std::string s = ss.str();
    std::cout << "s.length(): " << s.length() << std::endl; // amount * 3
    while (s.length() > 1024) {
        std::cout << s.substr(0, 1024) << std::endl;
        s = s.substr(1024);
    }
    std::cout << s << std::endl;
}

To circumvent this behavior, I would like to split up large strings (which I receive as such from an API) manually in parts of lengths less than 1024 chars (and then call std::cout separately on each of them). But I don't know which chars actually are just a non-ending part of an UTF-8 symbol and the built-in Unicode converters also seem to be unreliable (possibly also system-dependent?). Is there an easy way to determine whether a char in a std::string is a non-ending part of an UTF-8 symbol? The following quote explains why answering this question would be sufficient.

An UTF-8 character can, for example, consist of three chars. So if one splits a string into two parts, it should keep those three characters together. Otherwise, one has to do what the existing GUI controls clearly are not able to do consistently. For instance, reassembling UTF-8-characters that have been split into pieces.

Better ideas to circumvent the problem (others than "Don't use Windows" / "Don't use UTF-8" / "Don't use cout", of course) are also welcome.

Note that this question is unrelated to the Windows Console (I do not use it -- things are displayed in Eclise and optionally on wxWidgets UI elements, which display UTF-8 correctly). It is also unrelated to MSVC (I use the MinGW compiler, as I have mentioned). In the code is also mentioned that using std::wcout with UTF-16 does not work at all (due to ~~another MinGW~~ an Eclipse bug). The bug results from UI controls being unable to handle what std::cout does (which may be intentional or not). Furthermore, everything usually works fine, except for those UTF-8 symbols that were split up into different chars (e.g. \u301A into \u0003 + \u001A) at indices which are multiples of 1024 (and only randomly). This behavior implies already that most assumptions of commenters are false. Please consider the code -- especially its comments -- carefully rather than rushing to conclusions.

To clarify the display issue when calling myFaultyOutput():

In Eclipse CDT:

In Scintilla (implemented in wxWidgets as wxStyledTextCtrl):

Solution

I elaborated a fairly simple workaround by experimenting, of which I am surprised that nobody knew (I found nothing like that online).

N.m.'s attempted answer gave a good hint with mentioning the platform-specific function _setmode. What it does "by design" (according to this answer and this article) is to set the file translation mode, which is how the in- and output streams according to the process are handled. But at the same time, it invalidates using std::ostream / std::istream but dictates to use std::wostream / std::wistream for decently formatted in- and output streams.

For instance, using _setmode(_fileno(stdout), _O_U8TEXT) leads to that std::wcout now works well with outputting std::wstring as UTF-8, but std::cout prints out garbage characters, even on ASCII arguments. But I want to be able to mainly use std::string, especially std::cout for output. As I have mentioned, it is a rare case that the formatting for std::cout fails, so only in cases where I print out strings that may lead to this issue (potential multi-char-encoded-characters at indices of at least 1024) I want to use a special output function, say coutUtf8String(string s).

The default (untranslated) mode of _setmode is _O_BINARY. We can temporarily switch modes. So why not just switch to _O_U8TEXT, convert the UTF-8 encoded std::string object to std::wstring, use std::wcout on it, and then switch back to _O_BINARY? To stay platform-independent, one can just define the usual std::cout call when not on Windows. Here is the code:

#if defined(_WIN32) || defined(WIN32) || defined(__CYGWIN__)
#include <fcntl.h> // Also includes the non-standard file <io.h>
                   // (POSIX compatibility layer) to use _setmode on Windows NT.
#ifndef _O_U8TEXT // Some GCC distributions such as TDM-GCC 9.2.0 require this explicit
                  // definition since, depending on __MSVCRT_VERSION__, they might
                  // not define it.
#define _O_U8TEXT 0x40000
#endif
#endif

void coutUtf8String(string s) {
#if defined(_WIN32) || defined(WIN32) || defined(__CYGWIN__)
    if (s.length() > 1024) {
        // Set translation mode of wcout to UTF-8, renders cout unusable "by design"
        // (see https://developercommunity.visualstudio.com/t/_setmode_filenostdout-_O_U8TEXT;--/394790#T-N411680).
        if (_setmode(STDOUT_FILENO, _O_U8TEXT) != -1) {
            wcout << utf8toWide(s) << flush; // We must flush before resetting the mode.
             // Set translation mode of wcout to untranslated, renders cout usable again.
            _setmode(STDOUT_FILENO, _O_BINARY);
        } else
            // Let's use wcout anyway. Since no sink (such as Eclipse's console
            // window) is attached when _setmode fails, and such sinks seem to be
            // the cause for wcout to fail in default mode. The UI console view
            // is filled properly like this, regardless of translation modes.
            wcout << utf8toWide(s) << flush;
    } else
        cout << s << flush;
#else
    cout << s << flush;
#endif
}

wstring utf8toWide(const char* in) {
    wstring out;
    if (in == nullptr)
        return out;
    uint32_t codepoint = 0;
    while (*in != 0) {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
            codepoint = ch;
        else if (ch <= 0xbf)
            codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
            codepoint = ch & 0x1f;
        else if (ch <= 0xef)
            codepoint = ch & 0x0f;
        else
            codepoint = ch & 0x07;
        ++in;
        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
            if (codepoint > 0xffff) {
                out.append(1, static_cast<wchar_t>(0xd800 + (codepoint >> 10)));
                out.append(1, static_cast<wchar_t>(0xdc00 + (codepoint & 0x03ff)));
            } else if (codepoint < 0xd800 || codepoint >= 0xe000)
                out.append(1, static_cast<wchar_t>(codepoint));
        }
    }
    return out;
}

This solution is especially convenient since it does not factually deprecate UTF-8, std::string or std::cout which are mainly used for good reasons, but simply uses std::string itself and sustains platform-independency. I rather agree with this answer that adding wchar_t (and all the redundant rubbish that comes with it, such as std::wstring, std::wstringstream, std::wostream, std::wistream, std::wstreambuf) to C++ was a mistake. Only because Microsoft takes bad design decisions, one should not adopt their mistakes but rather circumvent them.

Visual confirmation: