Search code examples
pythonmmaplarge-files

Removing a large multi-line string from a very large file


I have a 10 GB text file, from which I want to find and delete a multi-line chunk. This chunk is given as another 10 MB text file, constituting a contentious section appearing once in the large file and spanning complete lines. Assuming I do not have enough memory to process the whole 10 GB in memory, what would be the easiest way to do so in some scripting language?

Example:

big.txt:

...

I have a 10 GB text file, from which I want to find and delete a multi-line chunk.

This chunk is given as another 10 MB text file,

constituting a contentious section appearing once in the large file and spanning complete lines.

Assuming I do not have enough memory to process the whole 10 GB in memory,

what would be the easiest way to do so in some scripting language?

...

chunk.txt:

This chunk is given as another 10 MB text file,

constituting a contentious section appearing once in the large file and spanning complete lines.

result.txt:

...

I have a 10 GB text file, from which I want to find and delete a multi-line chunk.

Assuming I do not have enough memory to process the whole 10 GB in memory,

what would be the easiest way to do so in some scripting language?

...

Solution

  • Following this comment, I implemented a python script to solve my issue using mmap, and it also works in more general conditions:

    • does not require complete lines
    • deals with multiple non-overlapping matches
    • deal with multiple chunk files by decreasing file size
    • works with bytes
    • chunks can be very large themselves

    Code:

    """Usage: python3 delchunk.py BIGFILE CHUNK_FILE_OR_FOLDER [OUTFILE]
    Given a large file BIGFILE, delete all complete non-overlapping possibly large chunks given by CHUNK_FILE_OR_FOLDER
    Multiple chunks will be deleted from the largest to the smallest
    If OUTFILE is not given, result will be saved to BIGFILE.delchunk
    """
    
    
    import mmap
    import os
    import shutil
    import sys
    
    
    if len(sys.argv) < 3:
        print(__doc__)
        sys.exit(1)
    output = sys.argv[3] if len(sys.argv) > 3 else sys.argv[1] + '.delchunk'
    if sys.argv[1] != output:
        shutil.copy(sys.argv[1], output)
    if os.path.isdir(sys.argv[2]):
        chunks = sorted([os.path.join(sys.argv[2], chunk) for chunk in os.listdir(sys.argv[2]) if os.path.isfile(os.path.join(sys.argv[2], chunk))], key=os.path.getsize, reverse=True)
    else:
        chunks = [sys.argv[2]]
    with open(output, 'r+b') as bigfile, mmap.mmap(bigfile.fileno(), 0) as bigmap:
        for chunk in chunks:
            with open(chunk, 'rb') as chunkfile, mmap.mmap(chunkfile.fileno(), 0, access=mmap.ACCESS_READ) as chunkmap:
                i = 0
                while True:
                    start = bigmap.rfind(chunkmap)
                    if start == -1:
                        break
                    i += 1
                    end = start + len(chunkmap)
                    print('Deleting chunk %s (%d) at %d:%d' % (chunk, i, start, end))
                    bigmap.move(start, end, len(bigmap) - end)
                    bigmap.resize(len(bigmap) - len(chunkmap))
                if not i:
                    print('Chunk %s not found' % chunk)
                else:
                    bigmap.flush()
    

    https://gist.github.com/eyaler/971efea29648af023e21902b9fa56f08