Search code examples
utf-8asciiwhitespace

Detect ASCII-whitespace in UTF-8 stream


Is it safe to use

ch >= '\0' && ch <=' '

as a condition that detects ASCII whitespace? (I am ignoring characters like non-breaking space.)

I am thinking of sequences like 0x8? 0x20, which then would be considered a whitespace, though the first character indicates that the sequence has not ended.


Solution

  • All UTF-8 bytes in a multi-byte sequence will have their highest bits set, so no byte in the range of 0x00 - 0x20 can be a part of such sequence. The only bytes that do not have the highest bit set are the stand-alone bytes that represent the first 128 characters of the US-ASCII table.

    Therefore, it is safe.