Search code examples
unicodeutf-8utfvariable-width

Why do so many of the possible UTF-8 encoding not correspond to a UTF-8 codepoint?


This table summarizes UTF-8's variable-width encoding scheme.

enter image description here

After noticing that not all the available 2-byte encodings are used, I extended the table as shown below, and saw that this was the case for 3-byte and 4-byte encodings as well (delta > 0).

| number of bytes | bits available for encodings | number of available encodings | number of codepoints                                | delta  = # avail encodings - # codepoints |
|-----------------|------------------------------|-------------------------------|-----------------------------------------------------|-------------------------------------------|
| 1               |          7                   |  2^7 =       128              |       0x80 -    0x00 =             8*16 =       128 |       128 -       128 =         0         |
| 2               | 5+1*6 = 11                   | 2^11 =     2,048              |     0x800 -    0x080 =    8*16^2 - 8*16 =     1,920 |     2,048 -     1,920 =       128         |
| 3               | 4+2*6 = 16                   | 2^16 =    65,536              |  0x1_0000 -    0x800 =    16^4 - 8*16^2 =    63,488 |    65,336 -    63,488 =     2,048         |
| 4               | 3+3*6 = 21                   | 2^21 = 2,097,152              | 0x11_0000 - 0x1_0000 = 0x10_0000 = 16^5 = 1,048,576 | 2,097,152 - 1,048,576 = 1,048,576         |

Is there a reason that so many encodings are unused?


Solution

  • A key feature of UTF-8 is self-synchronization. You can jump into the middle of a UTF-8 stream at any point and figure out where the next character begins. This is because many values cannot be the first byte of a character. A side effect of this is that many possible byte sequences are invalid, so the encoding is not as dense as it could be (requires more bytes than the minimum), but it makes every sequence unambiguous. This is even true if individual bytes are lost or corrupted. The decoder can always find the next character starting point and recover. This is not true of other multi-byte encoding schemes like UTF-16 or Shift-JIS which can be difficult to resynchronize if there are errors.

    The particular approach of encoding the character length in the first byte similarly reduces ambiguity in the face of dropped bytes in the stream or corruption. If you decode a start byte that indicates that this is a 3-byte character, followed by another start byte (rather than a continuation byte), the decoder can know that the previous character is corrupted. This leading length-indicator can also allow algorithms to skip over a given number of characters without decoding them.

    Again, these benefits come at a size cost. For ASCII, the size cost is zero, and it's pretty modest for Latin-based scripts. Compared to previous national systems, it imposes about a 50% cost on Asian encodings (usually 3 bytes rather than 2), and generally doubles the size of Cyrillic, Arabic, Hebrew, and Greek (generally 2 bytes rather than 1). But the benefit is that there's one system that works universally and can mix scripts, and has the above-mentioned synchronization benefits. So it's become pretty much ubiquitous despite the size cost.