I have a 10 GB text file, from which I want to find and delete a multi-line chunk. This chunk is given as another 10 MB text file, constituting a contentious section appearing once in the large file and spanning complete lines. Assuming I do not have enough memory to process the whole 10 GB in memory, what would be the easiest way to do so in some scripting language?
Example:
big.txt:
...
I have a 10 GB text file, from which I want to find and delete a multi-line chunk.
This chunk is given as another 10 MB text file,
constituting a contentious section appearing once in the large file and spanning complete lines.
Assuming I do not have enough memory to process the whole 10 GB in memory,
what would be the easiest way to do so in some scripting language?
...
chunk.txt:
This chunk is given as another 10 MB text file,
constituting a contentious section appearing once in the large file and spanning complete lines.
result.txt:
...
I have a 10 GB text file, from which I want to find and delete a multi-line chunk.
Assuming I do not have enough memory to process the whole 10 GB in memory,
what would be the easiest way to do so in some scripting language?
...
Following this comment, I implemented a python script to solve my issue using mmap, and it also works in more general conditions:
Code:
"""Usage: python3 delchunk.py BIGFILE CHUNK_FILE_OR_FOLDER [OUTFILE]
Given a large file BIGFILE, delete all complete non-overlapping possibly large chunks given by CHUNK_FILE_OR_FOLDER
Multiple chunks will be deleted from the largest to the smallest
If OUTFILE is not given, result will be saved to BIGFILE.delchunk
"""
import mmap
import os
import shutil
import sys
if len(sys.argv) < 3:
print(__doc__)
sys.exit(1)
output = sys.argv[3] if len(sys.argv) > 3 else sys.argv[1] + '.delchunk'
if sys.argv[1] != output:
shutil.copy(sys.argv[1], output)
if os.path.isdir(sys.argv[2]):
chunks = sorted([os.path.join(sys.argv[2], chunk) for chunk in os.listdir(sys.argv[2]) if os.path.isfile(os.path.join(sys.argv[2], chunk))], key=os.path.getsize, reverse=True)
else:
chunks = [sys.argv[2]]
with open(output, 'r+b') as bigfile, mmap.mmap(bigfile.fileno(), 0) as bigmap:
for chunk in chunks:
with open(chunk, 'rb') as chunkfile, mmap.mmap(chunkfile.fileno(), 0, access=mmap.ACCESS_READ) as chunkmap:
i = 0
while True:
start = bigmap.rfind(chunkmap)
if start == -1:
break
i += 1
end = start + len(chunkmap)
print('Deleting chunk %s (%d) at %d:%d' % (chunk, i, start, end))
bigmap.move(start, end, len(bigmap) - end)
bigmap.resize(len(bigmap) - len(chunkmap))
if not i:
print('Chunk %s not found' % chunk)
else:
bigmap.flush()
https://gist.github.com/eyaler/971efea29648af023e21902b9fa56f08