C++ UTF-16 to char conversion (Linux/Ubuntu)

I am trying to help a friend with a project that was supposed to be 1H and has been now 3 days. Needless to say I feel very frustrated and angry ;-) ooooouuuu... I breath.

So the program written in C++ just read a bunch of file and process them. The problem is that my program reads files which are using a UTF-16 encoding (because the files contain words written in different languages) and a simple use to ifstream just doesn't seem to work (it reads and outputs garbage). It took me a while to realise that this was because the files were in UTF-16.

Now I spent literally the whole afternoon on the web trying to find info about READING UTF16 files and converting the content of a UTF16 line to char! I just can't seem to! It's a nightmare. I try to learn about <locale> and <codecvt>, wstring, etc. which I have never used before (I am specialised in graphics apps, not desktop apps). I just can't get it.

This is what I have done so fare (but doesn't work):

std::wifstream file2(fileFullPath);
std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>);
std::cout.imbue(loc);
while (!file2.eof()) {
    std::wstring line;
    std::getline(file2, line);
    std::wcout << line << std::endl;
}

That's the maximum I could come up with but it doesn't even work. And it doesn't do anything better. But the problem is that I don't understand what I am doing in the first place anyway.

SO PLEASE PLEASE HELP! This is really driving crazy that I can even read a G*** D*** text file.

On top, my friend uses Ubuntu (I use clang++) and this code needs -stdlib=libc++ which doesn't seem to be supported by gcc on his side (even though he uses a pretty advanced version of gcc, which is 4.6.3 i believe). So I am not even sure using codecvt and locale is a good idea (as in "possible"). Would there be a better (another) option.

If I convert all the files to utf-8 just from the command line (using a linux command) am I going to potentially lose information?

Thank a lot, I will ever be grateful to you if you help me on this.

Solution

If I convert all the files to utf-8 just from the command line (using a linux command) am I going to potentially lose information?

No, all UTF-16 data can be losslessly converted to UTF-8. This is probably the best thing to do.

When wide characters were introduced they were intended to be a text representation used exclusively internal to a program, and never written to disk as wide characters. The wide streams reflect this by converting the wide characters you write out to narrow characters in the output file, and converting narrow characters in a file to wide characters in memory when reading.

std::wofstream wout("output.txt");
wout << L"Hello"; // the output file will just be ASCII (assuming the platform uses ASCII).

std::wifstream win("ascii.txt");
std::wstring s;
wout >> s; // the ascii in the file is converted to wide characters.

Of course the actual encoding depends on the codecvt facet in the stream's imbued locale, but what the stream does is use the codecvt to convert from wchar_t to char using that facet when writing, and convert from char to wchar_t when reading.

However since some people started writing files out in UTF-16 other people have just had to deal with it. The way they do that with C++ streams is by creating codecvt facets that will treat char as holding half a UTF-16 code unit, which is what codecvt_utf16 does.

So with that explaination, here are the problems with your code:

std::wifstream file2(fileFullPath); // UTF-16 has to be read in binary mode
std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>); // do you really want char32_t data? or do you want wchar_t?
std::cout.imbue(loc); // You're not even using cout, so why are you imbuing it?
// You need to imbue file2 here, not cout.
while (!file2.eof()) { // Aside from your UTF-16 question, this isn't the usual way to write a getline loop, and it doesn't behave quite correctly
    std::wstring line;
    std::getline(file2, line);
    std::wcout << line << std::endl; // wcout is not imbued with a locale that will correctly display the original UTF-16 data
}

Here's one way to rewrite the above:

// when reading UTF-16 you must use binary mode
std::wifstream file2(fileFullPath, std::ios::binary);

// ensure that wchar_t is large enough for UCS-4/UTF-32 (It is on Linux)
static_assert(WCHAR_MAX >= 0x10FFFF, "wchar_t not large enough");

// imbue file2 so that it will convert a UTF-16 file into wchar_t data.
// If the UTF-16 files are generated on Windows then you probably want to
// consume the BOM Windows uses
std::locale loc(
    std::locale(),
    new std::codecvt_utf16<wchar_t, 0x10FFFF, std::consume_header>);
file2.imbue(loc);

// imbue wcout so that wchar_t data printed will be converted to the system's
// encoding (which is probably UTF-8).
std::wcout.imbue(std::locale(""));

// Note that the above is doing something that one should not do, strictly
// speaking. The wchar_t data is in the wide encoding used by `codecvt_utf16`,
// UCS-4/UTF-32. This is not necessarily compatible with the wchar_t encoding
// used in other locales such as std::locale(""). Fortunately locales that use
// UTF-8 as the narrow encoding will generally also use UTF-32 as the wide
// encoding, coincidentally making this code work

std::wstring line;
while (std::getline(file2, line)) {
  std::wcout << line << std::endl;
}