Search code examples
pythoniterable

How to read UTF16-BE encoded bytes with length header


I want to decode a series of strings of variable length which have been encoded in UTF16-BE preceded by a two-bytes long big-endian integer indicating the half the byte-length of the following string. e.g:

Length    String (encoded)           Length    String (encoded)               ...
\x00\x05  \x00H\x00e\x00l\x00l\x00o  \x00\x06  \x00W\x00o\x00r\x00l\x00d\x00! ...

All these strings and their length headers are concatenated in one big bytestring.

I have the encoded bytestring as bytes object in memory. I would like to have an iterable function which would yield strings until it reaches the end of the ByteString.


Solution

  • Not a huge improvement, but your code can be streamlined a bit.

    def decode_strings(byte_string: ByteString) -> Generator[str]:
        with io.BytesIO(byte_string) as stream:
            while (s := stream.read(2)):
                length = int.from_bytes(s, byteorder="big")
                yield bytes.decode(stream.read(length), encoding="utf_16_be")