Search code examples
unicodecharacter-encodingutf-16

Total number of UTF16 Characters


Can you calculate that a UTF16 Encoding represents 1,112,064 numbers by permuations/commbinations?


Solution

  • The UNICODE standard is section 3.9 says:

    Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences.

    Hence the number of code points ('characters') that can be represented by UTF-16 is

    0xD7FF + 1 + (0x10FFFF - 0xE000) + 1 = 1 112 064
    

    The UNICODE standard is generally 32-bit. However, specific encodings reserve smaller amount of bits to represent the most common characters impose specific limitations on the real number of characters they can legally represent. To allow for longer bit sequences that in turn allow describing code points longer that 8 (UTF-8) or 16 (UTF-16) bits special surrogate code points are defined.

    Also, being able to represent a given code point in the given encoding doesn't mean it is valid — it has to be allocated and described by the UNICODE standard first. Therefore there's no mathematical formula which would yield the number of characters that can be represented and the number 1 112 064 doesn't necessarily mean there are 1M valid characters.

    For a detailed discussion see section 3.9 of the UNICODE standard.