This question relates to Construct python library although that is not really important.
I am writing a piece of code that needs to parse UTF16/32 encoded strings, there is no prefix (alike PascalString) and there is arbitrary data that follows it. I need someone to confirm my understanding oh these encodings. I know how to write a parser if these assumptions hold.
I realise that some codepoints are not necessarily 2 bytes (UTF16).
Yes, by definition UTF-16 must come in multiples of 2 bytes and UTF-32 must come in multiples of 4 bytes.
For UTF-32, each codepoint will be 4 bytes. For UTF-16, each codepoint may be 2 or 4 bytes, which will be determined by the word values - 0xd800 to 0xdfff will only occur in 4 byte sequences, and the rest will only occur in 2 byte sequences. See the Wikipedia page on UTF-16 for details.
Codepoint 0 is not officially excluded from Unicode, so it could appear as part of a valid sequence. It is unlikely, so it's not unreasonable to use it to mark the end of a string.