Search code examples
pythonchunksmerge-file

joining big files in Python


I have several HEVEC files that I'd like to merge. With small files (about 1.5 GB) the following code works fine

with open(path+"/"+str(sys.argv[2])+"_EL.265", "wb") as outfile:
        for fname in dirs:
                with open(path+"/"+fname, 'rb') as infile:
                    outfile.write(infile.read())

With bigger files (8 GB or more) the same code get stuck. I've copied from here (Lazy Method for Reading Big File in Python?) the code to read big files in chunks and I've integrated it with my code:

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


with open(path + "/" + str(sys.argv[2]) + "_BL.265", "wb") as outfile_bl:
        for fname in dirs:
                    with open(path+"/"+fname, 'rb') as infile:
                            for piece in read_in_chunks(infile):
                                outfile_bl.write(infile.read())

This code produces a file that is of the right size but it is no longer an HEVC file and cannot be read by a video player.

Any Idea? Please help

Dario


Solution

  • You are reading from infile in two different places: inside read_in_chunks, and directly when you call outfile_bl. This causes you to skip writing the data just read into the variable piece, so you only copy roughly half the file.

    You've already read data into piece; just write that to your file.

    with open(path + "/" + str(sys.argv[2]) + "_BL.265", "wb") as outfile_bl:
        for fname in dirs:
            with open(path+"/"+fname, 'rb') as infile:
                for piece in read_in_chunks(infile):
                    outfile_bl.write(piece)
    

    As an aside, you don't really need to define read_in_chunks, or at least its definition can be simplified greatly by using iter:

    def read_in_chunks(file_object, chunk_size=1024):
        """Lazy function (generator) to read a file piece by piece.
        Default chunk size: 1k."""
    
        yield from iter(lambda: file_object.read(chunk_size), '')
    
        # Or
        # from functools import partial
        # yield from iter(partial(file_object.read, chunk_size), '')