Search code examples
pythonutf-8

how python3 decode() knows how to delimit the code points?


How does the python3 decode() function knows how to delimit the hexs from each other given there is no delimiter contained in a byte string right? Do b-strings have delimiters under the hood?

Utf-8 encoded character may be a 1-B up to 4-B long, that's why I am asking.


Solution

  • The high bits of the first byte of a character inform the byte length of the character. See about UTF-8 in Wikipedia. Basically, if the first byte is

    • 0xxxxxxx: it is a 1-byte character
    • 110xxxxx: it is a 2-byte character
    • 1110xxxx: it is a 3-byte character
    • 11110xxx: it is a 4-byte character