Search code examples
unicodeencodingcharacter-encodingutf-16

How does UTF-16 encoding works?


Today I was learning about Character Encoding and Unicode but there is one thing I'm not sure about. I used this website to change to Unicode 101101101010111 (which from my understanding is a character set) and same symbol to UTF-16 (a Character Encoding System) 01010111 01011011 which is how it supposes to be saved in memory or desk.

  • Unicode is just a character set.
  • UTF-16 is a Encoding system that change charset in a way to save it on memory or desk.

Am I right? if yes how did Encoding system change 101101101010111 to 01010111 01011011? how does it work?


Solution

  • Unicode at the core is indeed a character set, i.e. it assigns numbers to what most people think of characters. These numbers are called codepoint.

    The codepoint for 字 is U+5B57. This is the format how codepoints are usually specified. "5B57" is hexadecimal number.

    In binary, 5B57 is 101101101010111, or 0101101101010111 if it is extended to 16 bits. But it is very unusual to specify codepoints in binary.

    UTF-16 is one of several encodings for Unicode, i.e. a representation in memory or in files. UTF-16 uses 16-bit code units. Since 16-bit is 2 bytes, two variants exist for splitting it into bytes:

    • little-ending (lower 8 bit first)
    • big-endian (higher 8 bits first)

    Often they are called UTF-16LE and UTF-16BE. Since most computers today use a little endian architecture, UTF-16LE is more common.

    A single codepoint can result in 1 or 2 UTF-16 code units. In this particular case, it's a single code unit, and it is the same as the value for the codepoint: 5B57. It is saved as two bytes, either as:

    5B 57 (or 01011011 01010111 in binary, big endian)

    57 5B (or 01010111 01011011 in binary, little endian)

    The latter one is the one you have shown. So it is UTF-16LE encoding.

    For codepoints resulting in 2 UTF-16 code units, the encoding is somewhat more involved. It is explained in the UTF-16 Wikipedia article.