Search code examples
c++unicodec++14surrogate-pairsucs

How to convert between a Unicode/UCS codepoint and a UTF16 surrogate pair?


How to convert back and forth between a Unicode/UCS codepoint and a UTF16 surrogate pair in C++14 and later?

EDIT: Removed mention of UCS-2 surrogates, as there is no such thing. Thanks @remy-lebeau!


Solution

  • In C++11 and later, you can use std::wstring_convert to convert between various UTF/UCS encodings, using the following std::codecvt types:

    You don't need to handle surrogates manually.

    You can use std::u32string to hold your codepoint(s), and std::u16string to hold your UTF-16/UCS-2 codeunits.

    For example:

    using convert_utf16_uf32 = std::wstring_convert<std::codecvt_utf16<char32_t>, char16_t>;
    
    std::u16string CodepointToUTF16(const char32_t codepoint)
    {
        const char32_t *p = &codepoint;
        return convert_utf16_uf32{}.from_bytes(
            reinterpret_cast<const char*>(p),
            reinterpret_cast<const char*>(p+1)
        );
    }
    
    std::u16string UTF32toUTF16(const std::u32string &str)
    {
        return convert_utf16_uf32{}.from_bytes(
            reinterpret_cast<const char*>(str.data()),
            reinterpret_cast<const char*>(str.data()+str.size())
        );
    }
    
    char32_t UTF16toCodepoint(const std::u16string &str)
    {
        std::string bytes = convert_utf16_uf32{}.to_bytes(str);
        return *(reinterpret_cast<const char32_t*>(bytes.data()));
    }
    
    std::u32string UTF16toUTF32(const std::u16string &str)
    {
        std::string bytes = convert_utf16_uf32{}.to_bytes(str);
        return std::u32string(
           reinterpret_cast<const char32_t*>(bytes.data()),
           bytes.size() / sizeof(char32_t)
        );
    }