Search code examples
c++cunicodeiconv

Issue with ICONV writing BOM if output is platform endian


When choosing UTF-32, for platform dependent endian, libiconv converts correctly but prefixes a 0xfeff BOM to the output stream. This causes some trouble.

When choosing UCS-4, no BOM is written but on my system it converts to 'big endian' which happens to be not the endianness of my system.

Are there any suggestions how to convert to UTF-32/UCS-4 with the platform-dependent endianess without having the remove the BOM manually?


Solution

  • iconv (both the glibc implementation and the GNU libiconv implementation) support encoding names that specify a fixed endianness:

    • UTF-32LE = UCS-4LE : UCS-4 in little endian flavour, without BOM
    • UTF-32BE = UCS-4BE : UCS-4 in big endian flavour, without BOM
    • UTF-16LE : UTF-16 in little endian flavour, without BOM
    • UTF-16BE : UTF-16 in big endian flavour, without BOM
    • wchar_t (an alias for UCS-4-INTERNAL) : UCS-4 with the platform's endianness and alignment restrictions

    Note that strings in these encodings should better not be transported to other machines, otherwise the lack of a BOM would cause problems.