Search code examples
serializationutf-8string-lengthvariable-length

Does utf-8 encoding need character length?


From what I understand, casual binary serialization implementations of non-static structures (like an array or a vector) would usually state the structure's "length" as the first word (usually a 64bit uint), then proceed to encode each entity's value, without separators (given the serialized subject data in each cell of the array is deterministic, so the binary parser doesn't need any lookahead or backtracking).

Would this behavior be the same, traditionally, for utf-8 strings? I can't see any other way for implementing a binary serialization for "unbounded" utf-8 strings, such that the parser wouldn't need backtracking (which can be really inefficient) or lookahead (which would also need excessive testing against various possibilities, also inefficient). My guess is that the "length" value would denote the number of characters, not the number of bytes, as the utf-8 encoding ranges from 1 to 4 bytes for each character, although the encoding itself denotes how many bytes exist in the character based on the first byte (eliminating backtracking and lookahead, per-character).

As an example, the octet stream for the string abc would be

[0,0,0,0,0,0,0,3,97,98,99]

where 0,0,0,0,0,0,0,3 denotes the uint64 length of the input string, abc.

Is my intuition correct, or is there something I'm missing?


Solution

  • In UTF-8, the Unicode code point U+0000 (NUL) is encoded as a single byte of value zero. It does not occur in the encoding of any other code point in UTF-8, so a null-terminated byte string can be used without a preceding length as long as embedded NUL is not allowed in the sequence; Otherwise, a preceding length can also be used as you have shown in the question.

    For example, the Unicode string "abcdéfg一二三四" is encoding as the hexadecimal bytes:

    61 62 63 64 c3 a9 66 67 e4 b8 80 e4 ba 8c e4 b8 89 e5 9b 9b 00
    a  b  c  d  é     f  g  一       二       三       四        ␀
    

    UTF-8 doesn't need backtracking or lookahead since the lead byte of a sequence indicates the number of trailing bytes required for the code point:

    61hex = 01100001bin (one-byte sequence)
    c3hex = 11000011bin (two-byte sequence)
    e4hex = 11100100bin (three-byte sequence)

    Trailing bytes all begin with 10xxxxxxbin:

    a9hex = 10101001bin (trailing byte)
    b8hex = 10111000bin (trailing byte)
    80hex = 10000000bin (trailing byte)