Search code examples
c++unicodeutf-8icu

Unexpected result when converting UNICODE to UTF-8 with ICU


I am trying to convert a Hebrew characters into UTF-8 and store them in a std::string.

I thought I'd give ICU a try.

Here is my minimal example:

#include <string>
#include <unicode/ustream.h>

int main(int argc, char** argv) 
{
   icu::UnicodeString us = L"עידן";
   std::string s;
   us.toUTF8String(s);
   return EXIT_SUCCESS;
}

I would expect s to display the same "characters" as the UnicodeString (although in UTF-8).

However, when looking at s in the debugger, the value of the string is: עידן.

I am clearly misunderstanding something about how to properly convert these unicode characters to fit and be understood using char.

What am I doing wrong here?


Solution

  • The Unicode string עידן encoded in UTF-8 is bytes D7 A2 D7 99 D7 93 D7 9F. Those same bytes interpreted in a latin-based encoding, like Windows-1252, are characters עידן.

    So, the actual bytes stored in the std::string are correct (and you can verify this by simply printing out the numeric values of the individual chars in the string). The debugger simply doesn't know the string is encoded in UTF-8 by default and is displaying the raw data using the wrong encoding.

    Depending on which debugger you are using, you might be able to tell it the string data is UTF-8 so it displays correctly. For example, Visual Studio has Format specifiers for its debugger. In this case, the s8 and s8b specifiers handle UTF-8 strings.