Search code examples
unicodecharacter-encodingutf-16file-typebyte-order-mark

Unicode BOM for UTF-16LE vs UTF32-LE


It seems like there's an ambiguity between the Byte Order Marks used for UTF16-LE and UTF-32LE. In particular, consider a file that contains the following 8 bytes:

FF FE 00 00 00 00 00 00

How can I tell if this file contains:

  1. The UTF16-LE BOM (FF FE) followed by 3 null characters; or
  2. The UTF32-LE BOM (FF FE 00 00) followed by one null character?

Unicode BOMs are described here: http://unicode.org/faq/utf_bom.html#bom4 but there's no discussion of this ambiguity. Am I missing something?


Solution

  • As the name suggests, the BOM only tells you the byte order, not the encoding. You have to know what the encoding is first, then you can use the BOM to determine whether the least or most significant bytes are first for multibyte sequences.

    A fortunate side-effect of the BOM is that you can also sometimes use it to guess the encoding if you don't know it, but that is not what it was designed for and it is no substitute for sending proper encoding information.