Search code examples
pythonfile-ioiorawraw-file

Achieving consistent block sizing in python raw file IO


Question up front:

Is there a pythonic way in the standard library for parsing raw binary files using for ... in ... syntax (i.e., __iter__/__next__) that yields blocks that respect the buffersize parameter, without having to subclass IOBase or its child classes?

Detailed explanation

I'd like to open a raw file for parsing, making use of the for ... in ... syntax, and I'd like that syntax to yield predictably shaped objects. This wasn't happening as expected for a problem I was working on, so I tried the following test (import numpy as np required):

In [271]: with open('tinytest.dat', 'wb') as f:
     ...:     f.write(np.random.randint(0, 256, 16384, dtype=np.uint8).tobytes())
     ...:

In [272]: np.array([len(b) for b in open('tinytest.dat', 'rb', 16)])
Out[272]:
array([  13,  138,  196,  263,  719,   98,  476,    3,  266,   63,   51,
    241,  472,   75,  120,  137,   14,  342,  148,  399,  366,  360,
     41,    9,  141,  282,    7,  159,  341,  355,  470,  427,  214,
     42, 1095,   84,  284,  366,  117,  187,  188,   54,  611,  246,
    743,  194,   11,   38,  196, 1368,    4,   21,  442,  169,   22,
    207,  226,  227,  193,  677,  174,  110,  273,   52,  357])

I could not understand why this random behavior was arising, and why it was not respecting the buffersize argument. Using read1 gave the expected number of bytes:

In [273]: with open('tinytest.dat', 'rb', 16) as f:
     ...:     b = f.read1()
     ...:     print(len(b))
     ...:     print(b)
     ...:
16
b'M\xfb\xea\xc0X\xd4U%3\xad\xc9u\n\x0f8}'

And there it is: A newline near the end of the first block.

In [274]: with open('tinytest.dat', 'rb', 2048) as f:
     ...:     print(f.readline())
     ...:
b'M\xfb\xea\xc0X\xd4U%3\xad\xc9u\n'

Sure enough, readline was being called to produce each block of the file, and it was tripping up on the newline value (corresponding to 10). I verified this reading through the code, lines in the definition of IOBase:

571    def __next__(self):
572    line = self.readline()
573    if not line:
574        raise StopIteration
575    return line

So my question is this: is there some more pythonic way to achieve buffersize-respecting raw file behavior that allows for ... in ... syntax, without having to subclass IOBase or its child classes (and thus, not being part of the standard library)? If not, does this unexpected behavior warrant a PEP? (Or does it warrant learning to expect the behavior?:)


Solution

  • This behavior isn't unexpected, it is documented that all objects derived from IOBase iterate over lines. The only thing that changes between binary vs text mode is how a line terminator is defined, it is always defined as b"\n" in binary mode.

    The docs:

    IOBase (and its subclasses) supports the iterator protocol, meaning that an IOBase object can be iterated over yielding the lines in a stream. Lines are defined slightly differently depending on whether the stream is a binary stream (yielding bytes), or a text stream (yielding character strings). See readline() below.

    The problem is that there used to historically be ambiguity between text and binary data in the type system, this was a major motivating factor of the Python 2 -> 3 transition breaking backwards-compatibility.

    I think it would certainly be reasonable to have the iterator protocol respect the buffer size for file objects opened in binary mode in Python 3. Why it was decided to keep the old behavior is something I can only speculate about.

    In any case, you should just define your own iterator, that is common in Python. Iterators are a basic building block, like built-in types.

    You can actually use the 2-argument iter(callable, sentinel) form to construct a super basic wrapper:

    >>> from functools import partial
    >>> def iter_blocks(f, n):
    ...     return iter(partial(f.read, n), b'')
    ...
    >>> np.array([len(b) for b in iter_blocks(open('tinytest.dat', 'rb'), 16)])
    array([16, 16, 16, ..., 16, 16, 16])
    

    Of course, you could have just used a generator:

    def iter_blocks(bin_file, n):
        result = bin_file.read(n)
        while result:
            yield result
            result = bin_file.read(n)
    

    There are tons of ways of approaching this. Again, iterators are a core type for writing idiomatic Python.

    Python is a pretty dyanamic language, and "duck typing" is the name of the game. Generally, your first instinct shouldn't be "how to subclass some built-in type to extend functionality". I mean, often that is possible, but you'll find that there are a lot of language features geared towards not having to do that, and often, it is simply better expressed that way to begin with, at least, usually to my eyes.