Search code examples
utf-8character-encodingbitflags

Why are only 128 characters encoded to one byte in UTF-8 and not 256 characters?


I'm trying to grasp UTF-8 encoding. The encoding of code points can range from 1 to 4 bytes. There are only 128 characters which encode to one byte. But since a byte has 8 bits it could encode 256(=2⁸) characters.

Therefore, why are only 128 characters encoded to one byte in UTF-8 and not 256 characters? I read Wikipedia.


Solution

  • One bit is used to indicate a multi-byte sequence. Removing a single bit from 2^8 leaves 2^7 (128).

    The number of leading 1 bits in a sequence indicates the length of the sequence. A leading 0 bit indicates this is a one-byte sequence, but that 0 means there's only 7 bits left for the code point data.

    A leading 110 indicates this is a two-byte sequence. Continuation bytes begin with 10. So a 2-byte sequence can encode 11 (16-3-2) bits.

    A leading 1110 indicates a three-byte sequence which encodes 16 (24-4-2*2) bits.

    And a leading 11110 indicates a four-byte sequence encoding 21 (32-5-2*3) bits. Unicode code points are defined as 21 bits, so that's enough. UTF-8 originally supported up to 6-byte sequences, but was restricted to 4-bytes when Unicode was restricted to 21 bits (to stay compatible with UTF-16).

    You may notice that one-byte and continuation bytes are "backwards." To be consistent, it would make sense for single-byte sequences to begin with a 10 and continuation bytes to begin with a 0. But this would break ASCII compatibility (which is a huge advantage of UTF-8) and also would reduce the number of 1-byte encodings to 64, which would be extremely inefficient. So a small inconsistency is added to great advantage.