Search code examples
c++c++11character-encodinglocale

Does C++ support converting between character encodings other than UTF-8, UTF-16, and UTF-32?


I understand that std::codecvt<char16_t, char> in C++11 performs conversion between UTF-16 and UTF-8, and std::codecvt<char32_t, char> performs conversion between UTF-32 and UTF-8. Is it possible to convert between, say, UTF-8 and ISO 8859-1?

Consider:

const char* s = "\u00C0";

If I print this string and my terminal's encoding is set to UTF-8, I will see the character À. If I set my terminal's encoding to ISO 8859-1, however, printing that string will not print out the desired character. How would I convert s into a string that, when printed, will show the character À if my terminal's encoding is set to ISO 8859-1?

I understand that this can be done with a library such as iconv, but I am curious whether it can be done using only the C++ standard library. I ask this question not because I don't want to use iconv, but because I don't really understand how locales work in C++.


Solution

  • In addition to the standard mandated encodings C++ also supports an implementation defined list of encodings via locales:

    #include <locale>
    #include <codecvt>
    #include <iostream>
    
    template <typename Facet>
    struct usable_facet : Facet {
      using Facet::Facet;
    };
    
    using codecvt = usable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>>;
    
    int main() {
      std::wstring_convert<codecvt> convert(new codecvt(".1252")); // platform specific locale strings
    
      std::wstring w = convert.from_bytes("\u00C0");
    }
    

    Unfortunately one of the things about wchar_t is that the standard mandates only that it use a fixed width encoding for all locales, but there's no requirement that it use the same encoding in different locales, and so you can't portably convert to wchar_t using one locale and then convert that back to char using a different locale.

    There is potentially some portable support for such conversions using functions like std::mbrtoc32 and related functions, but these are not yet widely implemented.

    I understand that this can be done with a library such as iconv, but I am curious whether it can be done using only the C++ standard library. I ask this question not because I don't want to use iconv, but because I don't really understand how locales work in C++.

    The locale library's design doesn't really lend itself to modern usage. C and C++ are themselves confused about encodings vs. character sets, and locales conflate lexical and orthographic issues with computational aspects such as encoding.

    How locales work is a topic a bit broader than is suitable for a stackoverflow answer but there are books on the topic. You'd probably also need to read platform specific materials, because the standard doesn't really give any context for much of the functionality. For example the locale library supports message catalogues, but doesn't tell you what they are or how you'd actually make one because that's functionality is not standardized by C++.