Search code examples
pythonfilefile-ioiofilesystems

Among the many Python file copy functions, which ones are safe if the copy is interrupted?


As seen in How do I copy a file in Python?, there are many file copy functions:

  • shutil.copy

  • shutil.copy2

  • shutil.copyfile (and also shutil.copyfileobj)

  • or even a naive method:

    with open('sourcefile', 'rb') as f, open('destfile', 'wb') as g:
        while True:
            block = f.read(16*1024*1024)  # work by blocks of 16 MB
            if not block:  # EOF
                break
            g.write(block)
    

Among all these methods, which ones are safe in the case of a copy interruption (example: kill the Python process)? The last one in the list looks ok.

By safe I mean: if a 1 GB file copy is not 100% finished (let's say it's interrupted in the middle of the copy, after 400MB), the file size should not be reported as 1GB in the filesystem, it should:

  • either report the size the file had when the last bytes were written (e.g. 400MB)
  • or be deleted

The worst would be that the final filesize is written first (internally with an fallocate or ftruncate?). This would be a problem if the copy is interrupted: by looking at the file-size, we would think the file is correctly written.

Many incremental backup programs (I'm coding one) use "filename+mtime+fsize" to check if a file has to be copied or if it's already there (of course a better solution is to SHA256 source and destination files but this is not done for every sync, too much time-consuming; off-topic here).

So I want to make sure that the "copy file" function does not store the final file size immediately (then it could fool the fsize comparison), before copying the actual file content.


Note: I'm asking the question because, while shutil.filecopy was rather straighforward on Python 3.7 and below, see source (which is more or less the naive method above), it seems much more complicated on Python 3.9, see source, with many different cases for Windows, Linux, MacOS, "fastcopy" tricks, etc.


Solution

  • Assuming that destfile does not exist prior to the copy, the naive method is safe, per your definition of safe.

    shutil.copyfileobj() and shutil.copyfile() are close second in the ranking.

    shutils.copy() is next, and shutils.copy2() would be last.

    Explanation:

    It is a filesystem's job to guarantee consistency based on application requests. If you are only writing X bytes to a file, the file size will only account for these X bytes.

    Therefore, doing direct FS operations like the naive method will work.

    It is now a matter of what these higher-level functions do with the filesystem.

    The API doesn't state what happens if python crashes mid-copy, but it is a de-facto expectation from everyone that these functions behave like Unix cp, i.e. don't mess with the file size.

    Assuming that the maintainers of CPython don't want to break people's expectations, all these functions should be safe per your definition.

    That said, it isn't guaranteed anywhere, AFAICT.

    However, shutil.copyfileobj() and shutil.copyfile() expressly have their API promise to not copy metadata, so they're not likely to try and set the size.

    shutils.copy() wouldn't try to set the file size, only the mode, and in most filesystems setting the size and the mode require two different filesystem operations, so it should still be safe.

    shutils.copy2() says it will copy metadata, and if you look at its source code, you'll see that it only copies the metadata after copying the data, so even that should be safe. Even more, copying the metadata doesn't copy the size.

    So this would only be a problem if some of the internal functions python uses try to optimize using ftruncate(), fallocate(), or some such, which is unlikely given that people who write system APIs (like the python maintainers) are very aware of people's expectations.