How to convert back and forth between a Unicode/UCS codepoint and a UTF16 surrogate pair in C++14 and later?
EDIT: Removed mention of UCS-2 surrogates, as there is no such thing. Thanks @remy-lebeau!
In C++11 and later, you can use std::wstring_convert
to convert between various UTF/UCS encodings, using the following std::codecvt
types:
UTF-8 <-> UCS-2:
std::codecvt_utf8<char16_t>
UTF-8 <-> UTF-16:
std::codecvt_utf8_utf16
UTF-8 <-> UTF-32/UCS-4:
std::codecvt_utf8<char32_t>
UCS-2 <-> UTF-16:
std::codecvt_utf16<char16_t>
UTF-16 <-> UTF-32/UCS-4:
std::codecvt_utf16<char32_t>
UCS-2 <-> UTF-32/UCS-4:
no standard conversion, but you can write your own std::codecvt
class for it if needed. Otherwise, use one of the above conversions in between:
UCS-2 <-> UTF-X <-> UTF-32/UCS-4
You don't need to handle surrogates manually.
You can use std::u32string
to hold your codepoint(s), and std::u16string
to hold your UTF-16/UCS-2 codeunits.
For example:
using convert_utf16_uf32 = std::wstring_convert<std::codecvt_utf16<char32_t>, char16_t>;
std::u16string CodepointToUTF16(const char32_t codepoint)
{
const char32_t *p = &codepoint;
return convert_utf16_uf32{}.from_bytes(
reinterpret_cast<const char*>(p),
reinterpret_cast<const char*>(p+1)
);
}
std::u16string UTF32toUTF16(const std::u32string &str)
{
return convert_utf16_uf32{}.from_bytes(
reinterpret_cast<const char*>(str.data()),
reinterpret_cast<const char*>(str.data()+str.size())
);
}
char32_t UTF16toCodepoint(const std::u16string &str)
{
std::string bytes = convert_utf16_uf32{}.to_bytes(str);
return *(reinterpret_cast<const char32_t*>(bytes.data()));
}
std::u32string UTF16toUTF32(const std::u16string &str)
{
std::string bytes = convert_utf16_uf32{}.to_bytes(str);
return std::u32string(
reinterpret_cast<const char32_t*>(bytes.data()),
bytes.size() / sizeof(char32_t)
);
}