I'm reading a large (10Gb) bzipped file in python3, which is utf-8-encoded JSON. I only want a few of the lines though, that start with a certain set of bytes, so to save having to decode all the lines into unicode, I'm reading the file in 'rb' mode, like this:
with bz2.open(filename, 'rb') as file:
for line in file:
if line.startswith(b'Hello'):
#decode line here, then do stuff
But I suddenly thought, what if one of the unicode characters contains the same byte as a newline character? By doing for line in file
will I risk getting truncated lines? Or does the linewise iterator over a binary file still work by magic?
Line-wise iteration will work for UTF-8 encoded data. Not by magic, but by design: UTF-8 was created to be backwards-compatible to ASCII.
ASCII only uses the byte values 0 through 127, leaving the upper half of possible values for extensions of any kind. UTF-8 takes advantage of this, in that any Unicode codepoint outside ASCII is encoded using bytes in the range 128..255.
For example, the letter "Ċ" (capital letter C with dot above) has the Unicode codepoint value U+010A
.
In UTF-8, this is encoded with the byte sequence C4 8A
, thus without using the byte 0A
, which is the ASCII newline.
In contrast, UTF-16 encodes the same character as 0A 01
or 01 0A
(depending on the Endianness).
So I guess UTF-16 is not safe to do line-wise iteration over.
It's not that common as file encoding though.