Search code examples
utf-8character-encodingusbasciiutf-16

Why does USB use UTF-16 for string (why not UTF-8)


UTF-16 require 2 byte and UTF-8 require 1 byte.
and USB is 8bit oriented, UTF-8 is more natural.

UTF-8 is backward compatible with ASCII, UTF-16 isn't.

UTF-16 require 2 byte, so, it could have endianness problem.
(Endianness problem occurred, later it was clearified by USB-IF as little endian.)

UTF-16 and UTF-8 are functionally

but why UTF-16? why not UTF-8?


Comparision of UTF-16 and UTF-8: https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16


Solution

  • UTF-16 require 2 byte and UTF-8 require 1 byte.

    This is wrong on both counts. Both UTF-8 and UTF-16 are variable-length encodings. You might be thinking of UCS-2 instead (UTF-16's predecessor), which did indeed use only 2 bytes (and as such was limited to codepoints up to U+FFFF only).

    UTF-8 uses 1 byte for codepoints U+0000 - U+007F, 2 bytes for codepoints U+0080 - U+07FF, 3 bytes for U+0800 - U+FFFF, and 4 bytes for codepoints U+10000 - U+10FFFF.

    UTF-16 uses 2 bytes for codepoints U+0000 - U+FFFF, and 4 bytes for codepoints U+10000 - U+10FFFF.

    and USB is 8bit oriented, UTF-8 is more natural.

    Not really. If you take into account the byte sizes mentioned above, UTF-16 actually handles more codepoints with fewer codeunits than UTF-8 does. But in any case, USB cares more about binary data than human readable text data. Even Unicode strings are prefixed with a byte count, not a character count. So the designers of USB could have used any encoding they wanted, as long as they standardized it. They chose UTF-16LE.

    Why? Ask the designers. My guess (and this is just a guess) is because Microsoft co-authored the USB 1.0 specification, and UCS-2 (now UTF-16LE) was Microsoft's encoding of choice for Windows, so they probably wanted to maintain compatibility without involving a lot of runtime conversions. Back then, Windows had almost 90% of the PC market, whereas other OSes, particularly *Nix, only had like 5%. Windows 98 was the first Windows version to have USB baked directly in the OS (USB was an optional add-on in Windows 95), but even then, USB was already becoming popular in PCs before Apple eventually added USB support to iMacs a few years later.

    Besides, probably more important, back then UTF-8 was still relatively new (it was only a few years old when USB 1.0 was authored), UCS-2 had been around for awhile and was the primary Unicode encoding at the time (Unicode would not exceed 65536 codepoints for a few more years). So it probably made sense at the time to have USB support international text by using UCS-2 (later UTF-16LE) instead of UTF-8. If they had decided on an 8bit encoding instead, ISO-8859-1 probably would have made more sense than UTF-8 (but by today's standards, ISO-8859-1 doesn't cut it anymore). And by the time Unicode did finally break the 65536-codepoint limit of UCS-2, it was too late to change the encoding to something else without breaking backwards compatibility. At least UTF-16 is backwards compatible with UCS-2 (this is the same reason why Windows is still using UTF-16 and not switch to UTF-8 like some other OSes have done).

    UTF-8 is backward compatible with ASCII, UTF-16 isn't.

    True.

    UTF-16 require 2 byte, so, it could have endianness problem.

    True. Same with UTF-32, for that matter.