Search code examples
unicodeutf-8character-encodingutf

Why is it necessary to mark continuation bytes in UTF-8?


I've recently been reading up on the UTF-8 variable-width encoding, and I found it strange that UTF-8 specifies the first two bits of every continuation byte to be 10.

 Range           |  Encoding
-----------------+-----------------
     0 - 7f      |  0xxxxxxx
    80 - 7ff     |  110xxxxx 10xxxxxx
   800 - ffff    |  1110xxxx 10xxxxxx 10xxxxxx
 10000 - 10ffff  |  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

I was playing around with other possible variable width encodings, and found that by using the following scheme, at most 3 bytes are necessary to store all of Unicode. If the first bit is a 1, then the character is encoded in at least one more byte (read until the first bit is a 0).

 Range           |  Encoding
-----------------+-----------------
     0 - 7f      |  0xxxxxxx
    80 - 407f    |  1xxxxxxx 0xxxxxxx
  4080 - 20407f  |  1xxxxxxx 1xxxxxxx 0xxxxxxx

Are the continuation bits in UTF-8 really that important? The second encoding seems much more efficient.


Solution

  • The UTF-8 is self-validating, fast on stepping forward, and easier to step backward.

    Self-validating: Since the first byte in the sequence specifies the length, the next X bytes must fit 10xxxxxx, or you have an invalid sequence. Seeing a 10xxxxxx byte by itself is immediately recognizable as invalid.
    Your suggested encoding has no validation built-in.

    Fast on step forward: If you have to skip the character, you can immediately skip X bytes as determined by the first byte, without having to examine each intermediate byte.

    Easier to step backward: If you have to read the bytes backwards, you can immediately recognize a continuation character by the 10xxxxxx. You'll then be able to scan backwards past the 10xxxxxx bytes for the 11xxxxxx lead byte, without having to scan past the lead byte.

    See UTF-8 Invalid sequences and error handling on Wikipedia.