Search code examples
unicodeencodingutf-8character

UTF-8 encoding why prefix 10?


As far as I know UTF-8 is a variable-length encoding, i.e. a character can be represented as 1 byte, 2 bytes, 3 bytes or 4 bytes.

For example the Unicode character U+00A9 = 10101001 is encoded in UTF-8 as

11000010 10101001, i.e. 0xC2 0xA9

The prefix 110 in the first byte indicates that the character is stored with two bytes (because I count two ones until zero in the prefix 110).

The prefix in the following bytes starts with 10

A 4-byte UTF-8 encoding would look like

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

prefix 11110 (four ones and zero) indicates four bytes and so on.

Now my question:

Why is the prefix 10 used in the following bytes? What is the advantage of such a prefix? Without 10 prefix in the following bytes I could use 3*2=6 bits more if I write:

11110000 xxxxxxxx xxxxxxxx xxxxxxxx


Solution

  • Historically there were many proposals to UTF-8's encoding. One of which uses no prefix in the following bytes and another named FSS-UTF which uses the prefix 1

    Number   First      Last       Byte 1   Byte 2   Byte 3   Byte 4   Byte 5
    of bytes code point code point
    1        U+0000     U+007F     0xxxxxxx                
    2        U+0080     U+207F     10xxxxxx 1xxxxxxx            
    3        U+2080     U+8207F    110xxxxx 1xxxxxxx 1xxxxxxx        
    4        U+82080    U+208207F  1110xxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx    
    5        U+2082080  U+7FFFFFFF 11110xxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx
    

    However finally a new encoding using the prefix 10 was chosen

    A modification by Ken Thompson of the Plan 9 operating system group at Bell Labs made it somewhat less bit-efficient than the previous proposal but crucially allowed it to be self-synchronizing, letting a reader start anywhere and immediately detect byte sequence boundaries.

    https://en.wikipedia.org/wiki/UTF-8#History

    The most obvious advantage of the new encoding is self-synchronization as others mentioned. It allows the reader to find the character boundaries easily, so any dropped byte or an invalid byte (for example over choppy network connections or a corrupted file) can be skipped quickly without destroying the whole content, and the current/previous/next characters can also be found immediately given any byte index in the string. If the indexed byte starts with 10 then just a middle byte, just go back or forward to find the start of the surrounding characters; otherwise if it starts with 0 or 11 then it's the start of a byte sequence

    That property is very important because in a badly designed encoding without self-synchronization like Shift-JIS the reader has to maintain a table of character offsets, or it'll have to reparse the array from the beginning to edit a string. A lost or wrong byte also cause the whole content from that byte to the end to be unreadable. In DOS/V for Japanese (which uses Shift-JIS) probably due to the limited amount of memory the table wasn't used, hence every time you press Backspace the OS will need to reiterate from the start to know which character was deleted. There's no way to get the length of the previous character like in the case of UTF-8

    The prefixed nature of UTF-8 also allows the old C string search functions to work without any modification because a search string's byte sequence can never appear in the middle of the another valid UTF-8 byte sequence. In Shift-JIS or other non-self-synchronized encoding you need a specialized search function because the a start byte can be a middle byte of another character

    Some of the above advantages are also shared by UTF-16

    Since the ranges for the high surrogates (0xD800–0xDBFF), low surrogates (0xDC00–0xDFFF), and valid BMP characters (0x0000–0xD7FF, 0xE000–0xFFFF) are disjoint, it is not possible for a surrogate to match a BMP character, or for two adjacent code units to look like a legal surrogate pair. This simplifies searches a great deal. It also means that UTF-16 is self-synchronizing on 16-bit words: whether a code unit starts a character can be determined without examining earlier code units (i.e. the type of code unit can be determined by the ranges of values in which it falls). UTF-8 shares these advantages, but many earlier multi-byte encoding schemes (such as Shift JIS and other Asian multi-byte encodings) did not allow unambiguous searching and could only be synchronized by re-parsing from the start of the string (UTF-16 is not self-synchronizing if one byte is lost or if traversal starts at a random byte).

    https://en.wikipedia.org/wiki/UTF-16#Description