Today I was learning about Character Encoding and Unicode but there is one thing I'm not sure about. I used this website to change 字
to Unicode 101101101010111
(which from my understanding is a character set) and same symbol to UTF-16 (a Character Encoding System) 01010111 01011011
which is how it supposes to be saved in memory or desk.
Am I right?
if yes how did Encoding system change 101101101010111
to 01010111 01011011
? how does it work?
Unicode at the core is indeed a character set, i.e. it assigns numbers to what most people think of characters. These numbers are called codepoint.
The codepoint for 字 is U+5B57
. This is the format how codepoints are usually specified. "5B57" is hexadecimal number.
In binary, 5B57
is 101101101010111
, or 0101101101010111
if it is extended to 16 bits. But it is very unusual to specify codepoints in binary.
UTF-16 is one of several encodings for Unicode, i.e. a representation in memory or in files. UTF-16 uses 16-bit code units. Since 16-bit is 2 bytes, two variants exist for splitting it into bytes:
Often they are called UTF-16LE and UTF-16BE. Since most computers today use a little endian architecture, UTF-16LE is more common.
A single codepoint can result in 1 or 2 UTF-16 code units. In this particular case, it's a single code unit, and it is the same as the value for the codepoint: 5B57
. It is saved as two bytes, either as:
5B 57
(or 01011011 01010111
in binary, big endian)
57 5B
(or 01010111 01011011
in binary, little endian)
The latter one is the one you have shown. So it is UTF-16LE encoding.
For codepoints resulting in 2 UTF-16 code units, the encoding is somewhat more involved. It is explained in the UTF-16 Wikipedia article.