python unicode python-unicode unicode-string

UTF16 and UTF32 decoder schema assumptions

This question relates to Construct python library although that is not really important.

I am writing a piece of code that needs to parse UTF16/32 encoded strings, there is no prefix (alike PascalString) and there is arbitrary data that follows it. I need someone to confirm my understanding oh these encodings. I know how to write a parser if these assumptions hold.

UTF16 must be multiple of 2 bytes, last chunk (and only last) must be \x00\x00
UTF32 must be multiple of 4 bytes, last chunk (and only last) must be \x00\x00\x00\x00

I realise that some codepoints are not necessarily 2 bytes (UTF16).

Solution

Yes, by definition UTF-16 must come in multiples of 2 bytes and UTF-32 must come in multiples of 4 bytes.

For UTF-32, each codepoint will be 4 bytes. For UTF-16, each codepoint may be 2 or 4 bytes, which will be determined by the word values - 0xd800 to 0xdfff will only occur in 4 byte sequences, and the rest will only occur in 2 byte sequences. See the Wikipedia page on UTF-16 for details.

Codepoint 0 is not officially excluded from Unicode, so it could appear as part of a valid sequence. It is unlikely, so it's not unreasonable to use it to mark the end of a string.