Unicode from string literal vs from file produces strange behaviour

I've been struggling through unicode support in C++, and am getting strange behaviour. If I load a line of unicode text from file, I can store it inside a regular string and output it to stdout without problems. But if I use a unicode string literal with the exact same characters, I have to store it in a wstring and it fails to output properly. Why?

Why is it possible to get unicode in a string instead of needing a wstring? Why does it cout correctly and wstring fails to wcout?

#include <fstream>
#include <iostream>
#include <string>

using namespace std;

int main(int argc,
         char* argv[])
{
    ifstream infile("unicode.txt");
    string strFromFile;
    getline(infile, strFromFile);
    infile.close();
    cout << "From file: " << strFromFile << endl;

    wstring strLiteral = L"ĀƁĊĐĒƑĢĦĪĴĶŁΜŊŌƤǬŖŞŦŪƲŴΧɎƵāƂċđēƒġħıĵķłɱŋōƥǭŗşŧūνŵχɏƶ";
    wcout << "From literal: " << strLiteral << endl;

    return 0;
}

And the output:

From file: ĀƁĊĐĒƑĢĦĪĴĶŁΜŊŌƤǬŖŞŦŪƲŴΧɎƵāƂċđēƒġħıĵķłɱŋōƥǭŗşŧūνŵχɏƶ
From literal: �
�"&*46A�JL��V^fj�t�N��
                      �!'157BqKM��W_gk�u�O�

I compiled this example with:

g++ main.cpp -Wall

And my compiler is:

g++ (Ubuntu 13.2.0-4ubuntu3) 13.2.0

Solution

Encoding. Most likely, your file is in UTF-8, which encodes Unicode in byte sequences of varying length. Print the length of the string that you read, and compare with the number of characters displayed.

Or do a hexdump -C unicode.txt, and observe that the characters on the right are now... garbage.

L"" strings are complicated in comparison, though. Observe these tests on my system:

#include <iostream>
#include <string>

int main()
{
    std::wcout << L"ĀƁĊĐĒƑĢĦĪĴĶŁΜŊŌƤǬŖŞŦŪƲŴΧɎƵāƂċđēƒġħıĵķłɱŋōƥǭŗşŧūνŵχɏƶ\n";
}

stieber@gatekeeper:~ $ g++ Test.cpp;  ./a.out
????????????M??????????CH???????????????????????n?ch??

Different from yours, but not what we want. Let's do something else:

#include <iostream>
#include <string>
#include <locale.h>

int main()
{
    setlocale(LC_ALL, "");
    std::wcout << L"ĀƁĊĐĒƑĢĦĪĴĶŁΜŊŌƤǬŖŞŦŪƲŴΧɎƵāƂċđēƒġħıĵķłɱŋōƥǭŗşŧūνŵχɏƶ\n";
}

stieber@gatekeeper:~ $ g++ Test.cpp;  ./a.out
ĀƁĊĐĒƑĢĦĪĴĶŁΜŊŌƤǬŖŞŦŪƲŴΧɎƵāƂċđēƒġħıĵķłɱŋōƥǭŗşŧūνŵχɏƶ

Much better. This is because we are setting the locale to the preferred locale, instead of the "C" locale, which just seems to hate unicode. Unfortunately, it's going to mess up things like numeric outputs, since locales are generally meant for localization, and using them for character encodings is just something that would need strong language to describe.

Using LC_CTYPE seems to work as well, and will probably affect fewer other things -- but, it's still not defined as "output character set" so it's still not 100% safe.

However, we can always do this:

#include <iostream>
#include <string>

int main()
{
    std::cout << "ĀƁĊĐĒƑĢĦĪĴĶŁΜŊŌƤǬŖŞŦŪƲŴΧɎƵāƂċđēƒġħıĵķłɱŋōƥǭŗşŧūνŵχɏƶ\n";
}

stieber@gatekeeper:~ $ g++ Test.cpp;  ./a.out
ĀƁĊĐĒƑĢĦĪĴĶŁΜŊŌƤǬŖŞŦŪƲŴΧɎƵāƂċđēƒġħıĵķłɱŋōƥǭŗşŧūνŵχɏƶ

Because we're just bypassing the locale and use UTF-8.

HOWEVER, observe:

#include <iostream>
#include <string>
#include <locale.h>

int main()
{
    setlocale(LC_ALL, "");
    std::cout << "ĀƁĊĐĒƑĢĦĪĴĶŁΜŊŌƤǬŖŞŦŪƲŴΧɎƵāƂċđēƒġħıĵķłɱŋōƥǭŗşŧūνŵχɏƶ\n";
    std::wcout << L"ĀƁĊĐĒƑĢĦĪĴĶŁΜŊŌƤǬŖŞŦŪƲŴΧɎƵāƂċđēƒġħıĵķłɱŋōƥǭŗşŧūνŵχɏƶ\n";
}

This does not work:

stieber@gatekeeper:~ $ g++ Test.cpp;  ./a.out
ĀƁĊĐĒƑĢĦĪĴĶŁΜŊŌƤǬŖŞŦŪƲŴΧɎƵāƂċđēƒġħıĵķłɱŋōƥǭŗşŧūνŵχɏƶ
▒
▒"&*46A▒JL▒▒V^fj▒t▒N▒▒
                      ▒!'157BqKM▒▒W_gk▒u▒O▒

We get the correct string first, followed by a garbled string. Do NOT mix std::cout and std::wcout -- I'm not entirely sure why that is, or whether I can reset whatever is responsible, but it seems like a bad idea for now.