Search code examples
qtendiannessutf-16byte-order-markutf-32

In Qt how do QTextCodec::codecForName("UTF-16") and codecForName("UTF-32") decide the endianness to use?


In the Qt documentation it states that (among others) the following Unicode string encodings are supported:

  • UTF-8
  • UTF-16
  • UTF-16BE
  • UTF-16LE
  • UTF-32
  • UTF-32BE
  • UTF-32LE

Due to the three different codecs listed for 2 and 4 octet encoded Unicode, I was wondering: how do the two non-endian codecs ("UTF-16" and "UTF-32") decide which endianness to use?


Solution

  • Based on the source code in src/corelibs/codecs/, it seems Qt uses the byte ordering of the host for UTF-16 and UTF-32.

    If you use QTextCodec to read an existing Unicode string that has a BOM, and you didn't explicitly ask to ignore the header, the byte ordering detected in the string is used.

    • In *qutfcodec_p.h* both QUtf16Codec::e and QUtf32Codec::e are initialized with the value DetectEndianness (an enum).

    • In qutfcodec.cpp, near the beginning of the functions convertFromUnicode and convertToUnicode from the classes QUtf16 and QUtf32 (used by QUtf16Codec and QUtf32Codec), you can find the line:

      endian = (QSysInfo::ByteOrder == QSysInfo::BigEndian) 
          ? BigEndianness : LittleEndianness;