Search code examples
c++utf-8g++localeclang++

wcin.imbue and UTF-8


On linux with g++, if I set a utf8 global locale, then wcin correctly transcodes UTF-8 to the internal wchar_t encoding.

However, if I use the classic locale and imbue an UTF8 locale into wcin, this doesn't happen. Input either fails altogether, or each individual byte gets converted to wchar_t independently.

With clang++ and libc++, neither setting the global locale nor imbuing the locale in wcin work.

#include <iostream>
#include <locale>
#include <string>

using namespace std;

int main() {
    if(true)        
        // this works with g++, but not with clang++/libc++
        locale::global(locale("C.UTF-8"));
    else
        // this doesn't work with either implementation
        wcin.imbue(locale("C.UTF-8"));
    wstring s;
    wcin >> s;
    cout << s.length() << " " << (s == L"áéú");
    return 0;
}

The input stream contains only áéú characters. (They are in UTF-8, not any single-byte encoding).

Live demo: one two (I can't reproduce the other behaviour with online compilers).

Is this standard-conforming? Shouldn't I be able to leave the global locale alone and use imbue instead?

Should either of the described behaviours be classified as an implementation bug?


Solution

  • First of all you should use wcout with wcin.

    Now you have two possible solutions to that:

    1) Deactivate synchronization of iostream and cstdio streams by using

       ios_base::sync_with_stdio(false);
    

    Note, that this should be the first call, otherwise the behavior depends on implementation.

    int main() {
    
       ios_base::sync_with_stdio(false);
       wcin.imbue(locale("C.UTF-8"));
    
       wstring s;
       wcin >> s;
       wcout << s.length() << " " << (s == L"áéú");
       return 0;
    }
    

    2) Localize both locale and wcout:

    int main() {
    
       std::setlocale(LC_ALL, "C.UTF-8");
       wcout.imbue(locale("C.UTF-8"));
    
        wstring s;
        wcin >> s;
        wcout << s.length() << " " << (s == L"áéú");
        return 0;
    }
    

    Tested both of them using ideone, works fine. I don't have clang++/libc++ with me, so wasn't able to test this behavior, sorry.