I have the following code which is just three sets of functions for converting UTF8 to UTF16 and vice-versa. It converts using 3 different techniques..
However, all of them fail:
std::ostream& operator << (std::ostream& os, const std::string &data)
{
SetConsoleOutputCP(CP_UTF8);
DWORD slen = data.size();
WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE), data.c_str(), data.size(), &slen, nullptr);
return os;
}
std::wostream& operator <<(std::wostream& os, const std::wstring &data)
{
DWORD slen = data.size();
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), data.c_str(), slen, &slen, nullptr);
return os;
}
std::wstring AUTF8ToUTF16(const std::string &data)
{
return std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes(data);
}
std::string AUTF16ToUTF8(const std::wstring &data)
{
return std::wstring_convert<std::codecvt_utf8<wchar_t>>().to_bytes(data);
}
std::wstring BUTF8ToUTF16(const std::string& utf8)
{
std::wstring utf16;
int len = MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), -1, NULL, 0);
if (len > 1)
{
utf16.resize(len - 1);
wchar_t* ptr = &utf16[0];
MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), -1, ptr, len);
}
return utf16;
}
std::string BUTF16ToUTF8(const std::wstring& utf16)
{
std::string utf8;
int len = WideCharToMultiByte(CP_UTF8, 0, utf16.c_str(), -1, NULL, 0, 0, 0);
if (len > 1)
{
utf8.resize(len - 1);
char* ptr = &utf8[0];
WideCharToMultiByte(CP_UTF8, 0, utf16.c_str(), -1, ptr, len, 0, 0);
}
return utf8;
}
std::string CUTF16ToUTF8(const std::wstring &data)
{
std::string result;
result.resize(std::wcstombs(nullptr, &data[0], data.size()));
std::wcstombs(&result[0], &data[0], data.size());
return result;
}
std::wstring CUTF8ToUTF16(const std::string &data)
{
std::wstring result;
result.resize(std::mbstowcs(nullptr, &data[0], data.size()));
std::mbstowcs(&result[0], &data[0], data.size());
return result;
}
int main()
{
std::string str = "консоли";
MessageBoxA(nullptr, str.c_str(), str.c_str(), 0); //Works Fine!
std::wstring wstr = AUTF8ToUTF16(str); //Crash!
MessageBoxW(nullptr, wstr.c_str(), wstr.c_str(), 0); //Fail - Crash + Display nothing..
wstr = BUTF8ToUTF16(str);
MessageBoxW(nullptr, wstr.c_str(), wstr.c_str(), 0); //Fail - Random chars..
wstr = CUTF8ToUTF16(str);
MessageBoxW(nullptr, wstr.c_str(), wstr.c_str(), 0); //Fail - Question marks..
std::cin.get();
}
The only thing that works above is the MessageBoxA
. I don't understand why because I'm told that Windows converts everything to UTF16 anyway so why can't I convert it myself?
Why does none of my conversions work?
Is there a reason my code does not work?
The root problem why all of your approaches fail is that they require the std::string
to be UTF-8 encoded but std::string str = "консоли"
is not UTF-8 encoded unless you save the .cpp file as UTF-8 and configure your compiler's default codepage to UTF-8. In most C++11 compilers, you can use the u8
prefix to force the string to use UTF-8:
std::string str = u8"консоли";
However, VS 2013 does not support that feature yet:
Unicode string literals 2010 No 2012 No 2013 No
Windows itself does not support UTF-8 in most API functions that take a char*
as input (an exception is MultiByteToWideChar()
when using CP_UTF8
). When you call an A
function, it calls the corresponding W
function internally, converting any char*
data to/from UTF-16 using Windows' default codepage (CP_ACP
). So you get random results when you use non CP_ACP
data with functions that are expecting it. As such, MessageBoxA()
will work correctly only if your .cpp file and compiler are using the same codepage as CP_ACP
so the unprefixed char*
data matches what MessageBoxA()
is expecting.
I don't know why AUTF8ToUTF16()
is crashing, probably a bug in your compiler's STL implementation when processing bad data.
BUTF8ToUTF16()
is not handling this case in the documentation: "If the input byte/char sequences are invalid, returns U+FFFD for UTF encodings." Also, your implementation is not optimal. Use length()
instead of -1
on inputs to avoid dealing with null terminator issues.
CUTF8ToUTF16()
is not doing any error handling or validations. However converting non-valid input to question marks or U+FFFD is very common in most libraries.