Search code examples
pythonfilepython-3.2

Is this a python 3 file bug?


Is this a bug? It demonstrates what happens when you use libtiff to extract an image from an open tiff file handle. It works in python 2.x and does not work in python 3.2.3

import os

# any file will work here, since it's not actually loading the tiff
# assuming it's big enough for the seek
filename = "/home/kostrom/git/wiredfool-pillow/Tests/images/multipage.tiff"

def test():
    fp1 = open(filename, "rb")
    buf1 = fp1.read(8)
    fp1.seek(28)
    fp1.read(2)
    for x in range(16):
        fp1.read(12)
    fp1.read(4)

    fd = os.dup(fp1.fileno())
    os.lseek(fd, 28, os.SEEK_SET)
    os.close(fd)

    # this magically fixes it: fp1.tell()
    fp1.seek(284)
    expect_284 = fp1.tell()
    print ("expected 284, actual %d" % expect_284)

test()

The output which I feel is in error is: expected 284, actual -504

Uncommenting the fp1.tell() does some ... side effect ... which stabilizes the py3 handle, and I don't know why. I'd also appreciate if someone can test other versions of python3.


Solution

  • os.dup creates a duplicate file descriptor that refers to the same open file description. Therefore, os.lseek(fd, 28, SEEK_SET) changes the seek position of the file underlying fp1.

    Python's file objects cache the file position to avoid repeated system calls. The side effect of this is that changing the file position without using the file object methods will desynchronize the cached position and the real position, leading to nonsense like you've observed.

    Worse yet, because the files are internally buffered by Python, seeking outside the file methods could actually cause the returned file data to be incorrect, leading to corruption or other nasty stuff.

    The documentation in bufferedio.c notes that tell can be used to reinitialize the cached value:

    * The absolute position of the raw stream is cached, if possible, in the
      `abs_pos` member. It must be updated every time an operation is done
      on the raw stream. If not sure, it can be reinitialized by calling
      _buffered_raw_tell(), which queries the raw stream (_buffered_raw_seek()
      also does it). To read it, use RAW_TELL().