What would be the best Pythonic way of implementing this awk command in python?
awk 'BEGIN{chunk=0} /^From /{msgs++;if(msgs==500){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}' mbox
I'm using this now to split up enormous mailbox (mbox format) files.
I'm trying a recursive method right now.
def chunkUp(mbox, chunk=0):
with open(mbox, 'r') as bigfile:
msg = 0
for line in bigfile:
if msg == 0:
with open("./TestChunks/chunks/chunk_"+str(chunk)+".txt", "a+") as cf:
if line.startswith("From "): msg += 1
cf.write(line)
if msg > 20: chunkUp(mbox, chunk+1)
I would love to be able to implement this in python and be able to resume progress if it is interrupted. Working on that bit now.
I'm tying my brain into knots! Cheers!
your recursive approach is doomed to fail: you may end up having too many open files at once, since the with
blocks don't exit until the end of the program.
Better have one handle open and write to it, close & reopen new handle when "From" is encountered.
also open your files in write mode, not append. The code below tries to do the minimal operations & tests to write each line in a file, and close/open another file when From:
is found. Also, in the end, the last file is closed.
def chunkUp(mbox):
with open(mbox, 'r') as bigfile:
handle = None
chunk = 0
for line in bigfile:
if line.startswith("From "):
# next (or first) file
chunk += 1
if handle is not None:
handle.close()
handle = None
# file was closed / first file: create a new one
if handle is None:
handle = open("./TestChunks/chunks/chunk_{}.txt".format(chunk), "w")
# write the line in the current file
handle.write(line)
if handle is not None:
handle.close()
I haven't tested it, but it's simple enough, it should work. If file doesn't have "From" in the first line, all lines before are stored in chunk_0.txt
file.