Search code examples
pythonwindowsfilesystemsbufferdisk

How to copy a large portion of a raw filesystem to a file?


I'm working with an arcane data collection filesystem. It's got a block describing the files and their exact offsets on disk, so I know each files' start byte, end byte and length in bytes. The goal is to grab one file from the physical disk. They're big files so performance is paramount.

Here's what "works," but very inefficiently:

import shutil, io
def start_copy(startpos, endpos, filename="C:\\out.bin"):
    with open(r"\\.\PhysicalDrive1", 'rb') as src_f:
        src_f.seek(startpos)
        flength = endpos - startpos
        print("Starting copy of "+filename+" ("+str(flength)+"B)")
        with open(filename, 'wb') as dst_f:
            shutil.copyfileobj( io.BytesIO(src_f.read(flength)), dst_f )
        print("Finished copy of "+filename)

This is slow: io.BytesIO(src_f.read(flength)) technically works, but it reads the entire file into memory before writing to the destination file. So it takes much longer than it should.

Copying directly using dst_f won't work. (I assume) the end position can't be specified, so the copy doesn't stop.

Here are some questions, each of which could be a solution to this:

  • Is there a copy library (or external utility for Windows 7 that would work with subprocess) that takes start/end byte arguments?
  • Is it possible to create a file-like object that copyfileobj can use, which references just a portion of another file-like object?
  • Can an exception be raised when an io object seeks past a certain end point?
  • Can copyfileobj be forced to naturally stop at a given byte offset of the drive (a sort of "fake EOF")?

Solution

  • The obvious way to do this is to just write to the file.

    The whole point of copyfileobj is that it buffers the data for you. If you have to read the whole file into a BytesIO, you're just buffering the BytesIO, which is pointless.

    So, just loop around reading a decent-sized buffer from src_f and write it to dst_f until you reach flength bytes.

    If you look at the shutil source (which is linked from the shutil docs), there's no magic inside copyfileobj; it's a trivial function. As of 3.6 (and I think it's been completely unchanged since shutil was added somewhere around 2.1…), it looks like this:

    def copyfileobj(fsrc, fdst, length=16*1024):
        """copy data from file-like object fsrc to file-like object fdst"""
        while 1:
            buf = fsrc.read(length)
            if not buf:
                break
            fdst.write(buf)
    

    You can do the same thing, just keeping track of bytes read and stopping at flength:

    def copypartialfileobj(fsrc, fdst, size, length=16*1024):
        """copy size bytes from file-like object fsrc to file-like object fdst"""
        written = 0
        while written < size:
            buf = fsrc.read(min(length, size - written))
            if not buf:
                break
            fdst.write(buf)
            written += len(buf)