Search code examples
c++c++11wstring

Why mask a char with 0xFF when converting narrow string to wide string?


Consider this function to convert narrow strings to wide strings:

std::wstring convert(const std::string& input)
{
    try
    {
        std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
        return converter.from_bytes(input);
    }
    catch(std::range_error& e)
    {
        std::size_t length = input.length();
        std::wstring result;
        result.reserve(length);
        for(std::size_t i = 0; i < length; i++)
        {
            result.push_back(input[i] & 0xFF);
        }
        return result;
    }
}

I am having difficulty understanding the need for this expression in the fallback path:

result.push_back(input[i] & 0xFF);

Why is each character in the string being masked with 0xFF (0b11111111)?


Solution

  • Masking with 0xFF reduces any negative values into the range 0-255.

    This is reasonable if, for example, your platform's char is an 8-bit signed type representing ISO-8859-1 characters, and your wchar_t is representing UCS-2, UTF-16 or UCS-4.


    Without this correction (or something similar, such as casting to unsigned char or std::byte), you would find that characters are sign-extended when promoted to the wider type.

    Example: 0xa9 (© in Unicode and Latin-1, -87 in signed 8-bit) would become \uffa9 instead of \u00a9.


    I think it's clearer to convert the char to an unsigned char - that works for any size char, and conveys the intent better. You can change that expression directly, or create a codecvt subclass that gives a name to what you're doing.

    Here's how to write and use a minimal codecvt (for narrow → wide conversion only):

    #include <codecvt>
    #include <locale>
    #include <string>
    
    class codecvt_latin1 : public std::codecvt<wchar_t,char,std::mbstate_t>
    {
    protected:
        virtual result do_in(std::mbstate_t&,
                             const char* from,
                             const char* from_end,
                             const char*& from_next,
                             wchar_t* to,
                             wchar_t* to_end,
                             wchar_t*& to_next) const override
        {
            while (from != from_end && to != to_end)
                *to++ = (unsigned char)*from++;
            from_next = from;
            to_next = to;
            return result::ok;
        }
    };
    
    std::wstring convert(const std::string& input)
    {
        using codecvt_utf8 = std::codecvt_utf8<wchar_t>;
        try {
            return std::wstring_convert<codecvt_utf8>().from_bytes(input);
        } catch (std::range_error&) {
            return std::wstring_convert<codecvt_latin1>{}.from_bytes(input);
        }
    }
    
    #include <iostream>
    
    int main()
    {
        std::locale::global(std::locale{""});
    
        // UTF-8:  £© おはよう
        std::wcout << convert(u8"\xc2\xa3\xc2\xa9 おはよう") << std::endl;
        // Latin-1: 壩
        std::wcout << convert("\xc2\xa3\xa9") << std::endl;
    }
    

    Output:

    £© おはよう
    壩