Search code examples
c++11utf-16utf-32codecvt

How do you use wstring_convert to convert between utf16 and utf32?


When you are going from std::u16string to, lets say std::u32string, std::wstring_convert doesn't work as it expects chars. So how does one use std::wstring_convert to convert between UTF-16 and UTF-32 using std::u16string as input?

For example :

inline std::u32string utf16_to_utf32(const std::u16string& s) {
    std::wstring_convert<std::codecvt_utf16<char32_t>, char32_t> conv;
    return conv.from_bytes(s); // cannot do this, expects 'char'
}

Is it ok to reinterpret_cast to char, as I've seen in a few examples?

If you do need to reinterpret_cast, I've seen some examples using the string size as opposed to the total byte size for the pointers. Is that an error or a requirement?

I know codecvt is deprecated, but until the standard offers an alternative, it has to do.


Solution

  • If you do not want to reinterpret_cast, the only way I've found is to first convert to utf-8, then reconvert to utf-32.

    For ex,

    // Convert to utf-8.
    std::u16string s;
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> conv;
    std::string utf8_str = conv.to_bytes(s);
    
    // Convert to utf-32.
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
    std::u32string utf32_str = conv.from_bytes(utf8_str);
    

    Yes this is sad and likely contributes to codecvt deprecation.