As this question is some years old Is C++20 'char8_t' the same as our old 'char'?
I would like to know, what is the recommended way to handle the char8_t and char conversion right now? boost::nowide (1.80.0) doesn´t not yet understand char8_t nor (AFAIK) boost::locale.
As Tom Honermann noted that
reinterpret_cast<const char *>(u8"text"); // Ok.
reinterpret_cast<const char8_t*>("text"); // Undefined behavior.
So: How do i interact with APIs that just accept const char*
or const wchar_t*
(think Win32 API) if my application "default" string type is std::u8string? The recommendation seems to be https://utf8everywhere.org/.
If i got a std::u8string and convert to std::string by
std::u8string convert(std::string str)
{
return std::u8string(reinterpret_cast<const char8_t*>(str.data()), str.size());
}
std::string convert(std::u8string str)
{
return std::string(reinterpret_cast<const char_t*>(str.data()), str.size());
}
This would invoke the same UB that Tom Honermann mentioned. This would be used when i talk to Win32 API or any other API that wants some const char*
or gives some const char*
back. I could go all conversions through boost::nowide but in the end i get a const char*
back from boost::nowide::narrow() that i need to cast.
Is the current recommendation to just stay at char and ignore char8_t?
This would invoke the same UB that Tom Honermann mentioned.
As pointed out in the post you referred to, UB only happens when you cast from a char*
to a char8_t*
. The other direction is fine.
If you are given a char*
which is encoded in UTF-8 (and you care to avoid the UB of just doing the cast for some reason), you can use std::transform
to convert the char
s to char8_t
s by converting the characters:
std::u8string convert(std::string str)
{
std::u8string ret(str.size());
std::ranges::transform(str, ret.begin(), [](char c) {return char8_t(c);});
return ret;
}
C++23's ranges::to
will make using a named return variable unnecessary.
For dealing with wchar_t
interfaces (which you shouldn't have to, since nowadays UTF-8 support exists through narrow character interfaces on Windows), you'll have to do an actual UTF-8->UTF-16 conversion. Which you would have had to do anyway.