Search code examples
python-3.xunicodenewlinefilehandle

Does reading a binary file linewise in python cause problems for unicode data?


I'm reading a large (10Gb) bzipped file in python3, which is utf-8-encoded JSON. I only want a few of the lines though, that start with a certain set of bytes, so to save having to decode all the lines into unicode, I'm reading the file in 'rb' mode, like this:

with bz2.open(filename, 'rb') as file:
    for line in file:
        if line.startswith(b'Hello'):
            #decode line here, then do stuff

But I suddenly thought, what if one of the unicode characters contains the same byte as a newline character? By doing for line in file will I risk getting truncated lines? Or does the linewise iterator over a binary file still work by magic?


Solution

  • Line-wise iteration will work for UTF-8 encoded data. Not by magic, but by design: UTF-8 was created to be backwards-compatible to ASCII.

    ASCII only uses the byte values 0 through 127, leaving the upper half of possible values for extensions of any kind. UTF-8 takes advantage of this, in that any Unicode codepoint outside ASCII is encoded using bytes in the range 128..255.

    For example, the letter "Ċ" (capital letter C with dot above) has the Unicode codepoint value U+010A. In UTF-8, this is encoded with the byte sequence C4 8A, thus without using the byte 0A, which is the ASCII newline.

    In contrast, UTF-16 encodes the same character as 0A 01 or 01 0A (depending on the Endianness). So I guess UTF-16 is not safe to do line-wise iteration over. It's not that common as file encoding though.