I am trying to convert a Hebrew characters into UTF-8
and store them in a std::string
.
I thought I'd give ICU a try.
Here is my minimal example:
#include <string>
#include <unicode/ustream.h>
int main(int argc, char** argv)
{
icu::UnicodeString us = L"עידן";
std::string s;
us.toUTF8String(s);
return EXIT_SUCCESS;
}
I would expect s
to display the same "characters" as the UnicodeString
(although in UTF-8).
However, when looking at s
in the debugger, the value of the string is: עידן
.
I am clearly misunderstanding something about how to properly convert these unicode characters to fit and be understood using char
.
What am I doing wrong here?
The Unicode string עידן
encoded in UTF-8 is bytes D7 A2 D7 99 D7 93 D7 9F
. Those same bytes interpreted in a latin-based encoding, like Windows-1252, are characters עידן
.
So, the actual bytes stored in the std::string
are correct (and you can verify this by simply printing out the numeric values of the individual char
s in the string
). The debugger simply doesn't know the string
is encoded in UTF-8 by default and is displaying the raw data using the wrong encoding.
Depending on which debugger you are using, you might be able to tell it the string
data is UTF-8 so it displays correctly. For example, Visual Studio has Format specifiers for its debugger. In this case, the s8
and s8b
specifiers handle UTF-8 strings.