Search code examples
c++stringunicodeutf-8character-encoding

Issue when converting utf16 wide std::wstring to utf8 narrow std::string for rare characters


Why do some utf16 encoded wide strings, when converted to utf8 encoded narrow strings convert to hex values that don't appear to be correct when converted using this commonly found conversion function?

std::string convert_string(const std::wstring& str)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
    return conv.to_bytes(str);
}

Hello. I have a C++ app on Windows which takes some user input on the command line. I'm using the wide character main entry point to get the input as a utf16 string which I'm converting to a utf8 narrow string using the above function.

This function can be found in many places online and works in almost all cases. I have however found a few examples where it doesn't seem to work as expected.

For example if I input an emojii character "🤢" as a string literal (in my utf8 encoded cpp file) and write it to disk, the file (FILE-1) contains the following data (which are the correct utf8 hex values specified here https://www.fileformat.info/info/unicode/char/1f922/index.htm):

    0xF0 0x9F 0xA4 0xA2

However if I pass the emojii to my application on the command line and convert it to a utf8 string using the conversion function above and then write it to disk, the file (FILE-2) contains different raw bytes:

    0xED 0xA0 0xBE 0xED 0xB4 0xA2

While the second file seems to indicate the conversion has produced the wrong output if you copy and paste the hex values (in notepad++ at least) it produces the correct emojii. Also WinMerge considers the two files to be identical.

so to conclude I would really like to know the following:

  1. how the incorrect-looking converted hex values map correctly to the right utf8 character in the example above
  2. why the conversion function converts some characters to this form while almost all other characters produce the expected raw bytes
  3. As a bonus I would like to know if it is possible to modify the conversion function to stop it from outputting these rare characters in this form

I should note that I already have a workaround function below which uses WinAPI calls, however using standard library calls only is the dream :)

std::string convert_string(const std::wstring& wstr)
{
    if(wstr.empty())
        return std::string();

    int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
    std::string strTo(size_needed, 0);
    WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
    return strTo;
}

Solution

  • The problem is that std::wstring_convert<std::codecvt_utf8<wchar_t>> converts from UCS-2, not from UTF-16. Characters inside of the BMP (U+0000..U+FFFF) have identical encodings in both UCS-2 and UTF-16 and so will work, but characters outside of the BMP (U+FFFF..U+10FFFF), such as your Emoji, do not exist in UCS-2 at all. This means the conversion doesn't understand the character and produces incorrect UTF-8 bytes (technically, it's converted each half of the UTF-16 surrogate pair into a separate UTF-8 character).

    You need to use std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> instead.