Search code examples
c++unicodeutf-8

how to convert text from CP437 encoding to UTF8 encoding?


In Windows, the value of the Unicode character ö (Latin small letter o with diaeresis) in the CP437 character set is 148.

In Linux, the byte value for ö in the UTF-8 encoding is:

-61(Hi Byte) 
-74(Lo Byte)
(unsigned value = 46787)

My Question is, how can I convert from 148 from CP437 to UTF-8 in C++ on Linux?

The detailed info for my problem lies here:

open() function in Linux with extended characters (128-255) returns -1 error

Temporary solution: C++11 supports the conversion to UTF-8 using codecvt_utf8


Solution

  • On Windows, you can use the Win32 MultiByteToWideChar() function to convert data from CP437 to UTF-16, and then use the WideCharToMultiByte() function to convert data from UTF-16 to UTF-8.

    On Linux, you can use a Unicode conversion library, like libiconv or ICU (which are available for Windows, too).


    In C++11 and later, you can use std::wstring_convert to:

    • convert from CP437 to either UTF-16 or UTF-32/UCS-4 (if you can get/make a codecvt for CP437, that is).

    • then, convert from UTF-16 or UTF-32/UCS-4 to UTF-8.

    You can't use codecvt_utf8 to convert from CP437 to UTF-8 directly. It only supports conversions between:

    • UTF-8 and UCS-2 (not UTF-16!)

    • UTF-8 and UTF-32/UCS-4.

    You have to use codecvt_utf8_utf16 for conversions between UTF-8 and UTF-16.

    Or, you can use mbrtoc16() to convert CP437 to UTF-16 using a CP437 locale, and then use c16rtomb() to convert UTF-16 to UTF-8 using a UTF-8 locale (if your STL library implements a fix for DR488, otherwise c16rtomb() only supports UCS-2 and not UTF-16!).


    Otherwise, just create your own CP437-to-UTF8 lookup table for the 256 possible CP437 bytes, and then do the conversion manually, one byte at a time.