Search code examples
stringutf-8character-encodingascii

What's the advantage of UTF-8 over highest-bit encoding such as ULEB128 or LLVM bitcode?


Highest-bit encoding means

if a byte starts with 0 it's the final byte, if a byte starts with 1 it's not the final byte

notice, this is also ASCII compatible, in fact, this is what LLVM bitcode does.


Solution

  • The key property that UTF-8 has which ULEB128 lacks, is that no UTF-8 character encoding is a substring of any other UTF-8 character. Why does this matter? It allows UTF-8 to satisfy an absolutely crucial design criterion: you can apply C string functions that work on single-byte encodings like ASCII and Latin-1 to UTF-8 and they will work correctly. This is not the case for ULEB128, as we'll see. That in turn means that most programs written to work with ASCII or Latin-1 will also "just work" when passed UTF-8 data, which was a huge benefit for the adoption of UTF-8 on older systems.

    First, here's a brief primer on how ULEB128 encoding works: each code point is encoded as a sequence of bytes terminated by a byte without the high bit set—we'll call that a "low byte" (values 0-127); each leading byte has the high bit set—we'll call that a "high byte" (values 128-255). Since ASCII characters are 0-127, they are all low bytes and thus are encoded the same way in ULEB128 (and Latin-1 and UTF-8) as in ASCII: i.e. any ASCII string is also a valid ULEB128 string (and Latin-1 and UTF-8), encoding the same sequence of code points.

    Note the subcharacter property: any ULEB128 character encoding can be prefixed with any of the 127 high byte values to encode longer, higher code points. This cannot occur with UTF-8 because the lead byte encodes how many following bytes there are, so no UTF-8 character can be contained in any other.

    Why does this property matter? One particularly bad case of this is that the ULEB128 encoding of any character whose code point is divisible by 128 will contain a trailing NUL byte (0x00). For example, the character Ā (capital "A" with bar over it) has code point U+0100 which would be encoded in ULEB128 as the bytes 0x82 and 0x00. Since many C string functions interpret the NUL byte as a string terminator, they would incorrectly interpret the second byte to be the end of the string and interpret the 0x82 byte as an invalid dangling high byte. In UTF-8 on the other hand, a NUL byte only ever occurs in the encoding of the NUL byte code point, U+00, so this issue cannot occur.

    Similarly, if you use strchr(str, c) to search a ULEB128-encoded string for an ASCII character, you can get false positives. Here's an example. Suppose you have the following very Unicode file path: 🌯/☯. This is presumably a file in your collection of burrito recipes for the burrito you call "The Yin & Yang". This path name has the following sequence of code points:

    • 🌯 — Unicode U+1F32F
    • / — ASCII/Unicode U+2F
    • — Unicode U+262F

    In ULEB128 this would be encoded with the following sequence of bytes:

    • 10000111 – leading byte of 🌯 (U+1F32F)
    • 11100110 – leading byte of 🌯 (U+1F32F)
    • 00101111 – final byte of 🌯 (U+1F32F)
    • 00101111 – only byte of / (U+2F)
    • 11001100 – leading byte of (U+262F)
    • 00101111 – final byte of (U+262F)

    If you use strchr(path, '/') to search for slashes to split this path into path components, you'll get bogus matches for the last byte in 🌯 and the last byte in , so to the C code it would look like this path consists of an invalid directory name \x87\xE6 followed by two slashes, followed by another invalid directory name \xCC and a final trailing slash. Compare this with how this path is encoded in UTF-8:

    • 11110000 — leading byte of 🌯 (U+1F32F)
    • 10011111 — continuation byte of 🌯 (U+1F32F)
    • 10001100 — continuation byte of 🌯 (U+1F32F)
    • 10101111 — continuation byte of 🌯 (U+1F32F)
    • 00101111 — only byte of / (U+2F)
    • 11110000 — leading byte of (U+262F)
    • 10011111 — continuation byte of (U+262F)
    • 10010000 — continuation byte of (U+262F)
    • 10101111 — continuation byte of (U+262F)

    The encoding of / is 0x2F which doesn't—and cannot—occur anywhere else in the string since the character does not occur anywhere else—the last bytes of 🌯 and are 0xAF rather than 0x2F. So using strchr to search for / in the UTF-8 encoding of the path is guaranteed to find the slash and only the slash.

    The situation is similar with using strstr(haystack, needle) to search for substrings: using ULEB128 this would find actual occurrences of needle, but it would also find occurrences of needle where the first character is replaced by any character whose encoding extends the actual first character. Using UTF-8, on the other hand, there cannot be false positives—every place where the encoding of needle occurs in haystack it can only encode needle and cannot be a fragment of the encoding of some other string.

    There are other pleasant properties that UTF-8 has—many as a corollary of the subcharacter property—but I believe this is the really crucial one. A few of these other properties:

    • You can predict how many bytes you need to read in order to decode a UTF-8 character by looking at its first byte. With ULEB128, you don't know you're done reading a character until you've read the last byte.

    • If you embed a valid UTF-8 string in any data, valid or invalid, it cannot affect the interpretation of the valid substring. This is a generalization of property that no character is a subsequence of any other character.

    • If you start reading a UTF-8 stream in the middle and you can't look backwards, you can tell if you're at the start of a character or in the middle of one. With ULEB128, you might decode the tail end of a character as a different character that looks valid but is incorrect, whereas with UTF-8 you know to skip ahead to the start of a whole character.

    • If you start reading a UTF-8 stream in the middle and you want to skip backwards to the start of the current character, you can do so without reading past the start of the character. With ULEB128 you'd have read one byte before the start of a character (if possible) to know where the character starts.

    • To address the previous two points you could use a little-endian variant of ULEB128 where you encode the bits of each code point in groups of seven from least to most significant. Then the first byte of each character's encoding is a low byte and the trailing bytes are high bytes. However, this encoding sacrifices another pleasant property of UTF-8, which is that if you sort an encoded strings lexicographically by bytes, they are sorted into the same order as if you had sorted them lexicographically by characters.

    An interesting question is whether UTF-8 is inevitable given this set of pleasant properties? I suspect that some variation is possible, but not very much.

    • You want to anchor either the start or end of the character with some marker that cannot occur anywhere else. In UTF-8 this is a byte with 0, 2, 3 or 4 leading ones, which indicates the start of a character.

    • You want to encode the length of a character somehow, not just an indicator of when a character is done, since this prevents extending one valid character to another one. In UTF-8 the number of leading bits in the leading byte indicates how many bytes are in the character.

    • You want the start/end marker and the length encoding at the beginning so that you know in the middle of a stream that if you're at the start of a character or not and so that you know how many bytes you need to complete the character.

    There are probably some variations that can satisfy all of this and the lexicographical ordering property, but with all these constraints UTF-8 starts to seem pretty inevitable.