unicode encoding utf-8 character-encoding

Cay anyone show me the calculation how utf-8 represents 1112064 characters?

I am not understanding how UTF-8 represents 1112064 characters.

My calculation is something like this: 2⁷ + 2¹¹ + 2¹⁶ + 2²¹ = 2164864 characters.

To represent any character in UTF-8, for 1 byte it has 7 bits, for 2 bytes it has 11 bits, for 3 bytes it has 16 bits, and for 4 bytes it has 21 bits.

Is the number 1112064 without Emojis?

Solution

1112064 is the number of valid Unicode code points. It consists of 17 regions of 65536 code points, U+NN0000..U+NNFFFF where NN is 0x00 (the BMP, or Basic Multilingual Plane) through 0x10, less the reserved 2048 code points used for surrogates in the UTF-16 encoding, U+D800..U+DFFF.

17 x 65536 - 2048 = 1112064

UTF-8 can represent more than that, but the specification restricts valid UTF-8 to only valid Unicode code points, encoded in the shortest representation, e.g. U+0000 can be encoded as 1-byte 0x00 and also 2-byte 0xC0 0x80, but the latter is invalid, as well as 3-byte and greater versions.