Search code examples
unicodeencodingutf-8character-encoding

Cay anyone show me the calculation how utf-8 represents 1112064 characters?


I am not understanding how UTF-8 represents 1112064 characters.

My calculation is something like this: 27 + 211 + 216 + 221 = 2164864 characters.

To represent any character in UTF-8, for 1 byte it has 7 bits, for 2 bytes it has 11 bits, for 3 bytes it has 16 bits, and for 4 bytes it has 21 bits.

Is the number 1112064 without Emojis?


Solution

  • 1112064 is the number of valid Unicode code points. It consists of 17 regions of 65536 code points, U+NN0000..U+NNFFFF where NN is 0x00 (the BMP, or Basic Multilingual Plane) through 0x10, less the reserved 2048 code points used for surrogates in the UTF-16 encoding, U+D800..U+DFFF.

    17 x 65536 - 2048 = 1112064

    UTF-8 can represent more than that, but the specification restricts valid UTF-8 to only valid Unicode code points, encoded in the shortest representation, e.g. U+0000 can be encoded as 1-byte 0x00 and also 2-byte 0xC0 0x80, but the latter is invalid, as well as 3-byte and greater versions.