Search code examples
pythonresumablejs

How to concat files in Python where total size > available memory


I'm writing a Python backend for Resumable.js, which allows uploading large files from a browser by splitting them into smaller chunks on the client.

Once the server has finished saving all chunks into a temporary folder, it needs to combine them. Individual chunks are quite small (1 MB by default) binary files, but their total size could be possibly larger than the web server's available memory.

How would you do the combining step in Python? Say a folder only contains n files, with names: "1", "2", "3"...

Can you explain how:

  • read()
  • write(.., 'wb')
  • write(.., 'ab')
  • shutil.copyfileobj()
  • mmap

would work in this case and what would be the recommended solution, based on these memory requirements?


Solution

  • Sticking to a purely pythonic solution (I assume you have your reasons for not going with 'cat' in Linux or 'copy' in Windows):

    with open('out_bin','wb') as wfd:
        for f in filepaths:
            with open(f,'rb') as fd:
                # 1MB per writing chunk.
                shutil.copyfileobj(fd, wfd, 1024 * 1024 * 1)
    

    will get the job done reliably and efficiently.

    Key points being writing and reading in binary mode ('wb', 'rb') so as to avoid pollution of the final result with unsolicited newline conversions that could otherwise happen, corrupting the final binary.

    If you are looking for the fastest approach then you might need to benchmark against the other methods you indicated an interest in, and I don't see any guarantees that the winner of said benchmark would not be somewhat OS dependent.