Search code examples
c++unicodebstr

How to convert between BSTR and 32-bit Unicode strings in Visual C++?


I have 3rd party code which punycodes strings (escapes and unescapes). As Unicode input/output, it uses 32-bit Unicode strings (uint32_t-based), not 16-bit. My own input/output is BSTR (UTF 16-bit). How should I convert between 32-bit Unicode char array and BSTR (both directions)?

The code should work in Visual C++ 6.0 and later versions.


Solution

  • UTF16 is same as UTF32 for characters less than 0xFFFF. You can use the following conversion to display UTF-32 codes in Windows.

    Note, this is based on Wikipedia UTF16 article. I didn't add any error checks, it expects valid codes.

    void get_utf16(std::wstring &str, int ch32)
    {
        const int mask = (1 << 10) - 1;
        if(ch32 < 0xFFFF)
        {
            str.push_back((wchar_t)ch32);
        }
        else
        {
            ch32 -= 0x10000;
            int hi = (ch32 >> 10) & mask;
            int lo = ch32 & mask;
    
            hi += 0xD800;
            lo += 0xDC00;
    
            str.push_back((wchar_t)hi);
            str.push_back((wchar_t)lo);
        }
    }
    

    For example the following code should display a smiley face in Windows 10:

    std::wstring str;
    get_utf16(str, 0x1f600);
    ::MessageBoxW(0, str.c_str(), 0, 0);
    


    Edit:

    Obtaining UTF-16 from array of UTF-32 code points, and the reverse operation:

    UTF-16 string can be one wchar_t character long (2 bytes per code point), or 2 wchar_t characters joined together (4 bytes per code point). If the first character is between 0xD800 and 0xE000 that indicates 4 bytes per code point.

    bool get_str_utf16(std::wstring &dst, const std::vector<unsigned int> &src)
    {
        const int mask = (1 << 10) - 1;
        for(size_t i = 0; i < src.size(); i++)
        {
            unsigned int ch32 = src[i];
            ////check for invalid range
            //if(ch32 > 0x10FFFF || (ch32 >= 0xD800 && ch32 < 0xE000))
            //{
            //  cout << "invalid code point\n";
            //  return false;
            //}
    
            if(ch32 > 0x10000)
            {
                ch32 -= 0x10000;
                int hi = (ch32 >> 10) & mask;
                int lo = ch32 & mask;
                hi += 0xD800;
                lo += 0xDC00;
                dst.push_back((wchar_t)hi);
                dst.push_back((wchar_t)lo);
            }
            else
            {
                dst.push_back((wchar_t)ch32);
            }
        }
        return true;
    }
    
    void get_str_utf32(std::vector<unsigned int> &dst, const std::wstring &src)
    {
        for(unsigned i = 0; i < src.size(); i++)
        {
            const wchar_t ch = src[i];
            if(ch >= 0xD800 && ch < 0xE000)
            {
                //this character is joined with the next character
                if(i < src.size() - 1)
                {
                    unsigned int hi = src[i]; i++;
                    unsigned int lo = src[i];
                    hi -= 0xD800;
                    lo -= 0xDC00;
                    unsigned int u32 = 0x10000 + (hi << 10) + lo;
                    dst.push_back(u32);
                }
            }
            else
            {
                dst.push_back(ch);
            }
        }
    }
    

    Example:

    std::wstring u16 = L"123🙂456";
    
    std::vector<unsigned int> u32;
    get_str_utf32(u32, u16);
    cout << "\n";
    
    cout << "UTF-32 result: ";
    for(auto e : u32)
        printf("0x%X ", e);
    cout << "\n";
    
    std::wstring test;
    get_str_utf16(test, u32);
    MessageBox(0, test.c_str(), (u16 == test) ? L"OK" : L"ERROR", 0);