Search code examples
c++c++11unicodeutf-8utf-16

How can I use std::codecvt_utf8_utf16 to convert to and from utf8 without any string class?


How can I use std::codecvt_utf8_utf16 to convert from uft8 to utf16 and back without using any string class such as std::string or std::wstring but only plain arrays and literal strings? How can I know the right size of the buffer I need to store the conversion?

For example to meet this interface:

std::unique_ptr<char16_t[]> ToUTF16(const char* utf8String);
std::unique_ptr<char[]> ToUTF8(const char16_t* utf16String);

Solution

  • You can do this by using the codecvt_utf8_utf16 members directly. Your first step is to find the length of the input with strlen (assuming it's NUL terminated). codecvt members work off of ranges, so you need to know how big your input is.

    However, a problem arises: the length of the output buffer. While codecvt does have a length member, it will only compute the length for conversions using in. That is, conversions from UTF-8 to UTF-16. There is no length method for doing the other conversion.

    As such, the only way to handle this is to convert some of the data to a buffer of known size. If the conversion isn't fully finished, then convert some more of the data. After all that's done, put all of the pieces into a buffer now that you know how many characters are going to be there.

    While your question says that you don't want to use strings, I'm going to use vector<T> for that because if I didn't, I'd just be rewriting vector. And there's no reason to do that.

    std::unique_ptr<char16_t[]> ToUTF16(const char* utf8String)
    {
        auto end_ptr = utf8String + std::char_traits<char>::length(utf8String);
        std::codecvt_utf8_utf16<char16_t> converter;
        std::codecvt_utf8_utf16<char16_t>::state_type state;
    
        std::array<char16_t, buffer_size> buffer;
        std::vector<char16_t> storage;
    
        auto curr_in_ptr = utf8String;
        auto out_loc = buffer.begin();
    
        do
        {
            std::codecvt_base::result rslt = converter.in(state,
                curr_in_ptr, end_ptr, curr_in_ptr,
                buffer.begin(), buffer.end(), out_loc);
    
            storage.insert(storage.end(), buffer.begin(), out_loc);
        }
        while(curr_in_ptr != end_ptr);
    
        //+1 for NUL terminator.
        std::unique_ptr<char16_t[]> ret(new char16_t[storage.size() + 1]);
        std::copy(storage.begin(), storage.end(), ret.get());
        ret.get()[storage.size()] = char16_t();
        return ret;
    }
    

    The other code works in the same way, except that in becomes out, and the char16_t's and char's are swapped.