Search code examples
c++stringencodingutf-8split

Split a UTF-8 encoded string on blank characters without knowing about UTF-8 encoding


I would like to split a string at every blank character (' ', '\n', '\r', '\t', '\v', '\f') The string is stored in UTF8 encoding in a byte array (char*, or vector or string, for instance)

Can I just split the byte array at each splitting character? Said otherwise, am I sure that the byte values corresponding to these characters cannot be found in a multi-byte character? By looking at the UTF-8 spec it seems all multibyte characters have only bytes higher than 128.

Thanks


Solution

  • Yes, you can.

    Multibyte sequences necessarily include one lead byte (the two MSBs equal to 11) and one or more continuation bytes (two MSBs equal to 10). The total length of the multibyte sequence (lead byte+continuation bytes) is equal to the number of count of MSBs equal to 1 in the lead byte, before the first bit 0 appears (e.g.: if lead byte is 110xxxxx, exactly one continuation byte should follow; if it is 11110xxx, there should be exactly three continuation bytes).

    So, if you find short MB sequences or stray continuationb bytes without a lead byte, your string is probably invalid anyway, and you split procedures probably wouldn't screw it any further than what it probably already was.

    But there is something you might want to note: Unicode introduces other “blank” symbols in the upper, non-ASCII compatible ranges. You might want to treat them accordingly.