I'm in my 2nd week of Python and I'm stuck on a directory of zipped/unzipped logfiles, which I need to parse and process.
Currently I'm doing this:
import os
import sys
import operator
import zipfile
import zlib
import gzip
import subprocess
if sys.version.startswith("3."):
import io
io_method = io.BytesIO
else:
import cStringIO
io_method = cStringIO.StringIO
for f in glob.glob('logs/*'):
file = open(f,'rb')
new_file_name = f + "_unzipped"
last_pos = file.tell()
# test for gzip
if (file.read(2) == b'\x1f\x8b'):
file.seek(last_pos)
#unzip to new file
out = open( new_file_name, "wb" )
process = subprocess.Popen(["zcat", f], stdout = subprocess.PIPE, stderr=subprocess.STDOUT)
while True:
if process.poll() != None:
break;
output = io_method(process.communicate()[0])
exitCode = process.returncode
if (exitCode == 0):
print "done"
out.write( output )
out.close()
else:
raise ProcessException(command, exitCode, output)
which I've "stitched" together using these SO answers (here) and blogposts (here)
However, it does not seem to work, because my test file is 2.5GB and the script has been chewing on it for 10+mins plus I'm not really sure if what I'm doing is correct anyway.
Question:
If I don't want to use GZIP module and need to de-compress chunk-by-chunk (actual files are >10GB), how do I uncompress and save to file using zcat and subprocess in Python?
Thanks!
This should read the first line of every file in the logs subdirectory, unzipping as required:
#!/usr/bin/env python
import glob
import gzip
import subprocess
for f in glob.glob('logs/*'):
if f.endswith('.gz'):
# Open a compressed file. Here is the easy way:
# file = gzip.open(f, 'rb')
# Or, here is the hard way:
proc = subprocess.Popen(['zcat', f], stdout=subprocess.PIPE)
file = proc.stdout
else:
# Otherwise, it must be a regular file
file = open(f, 'rb')
# Process file, for example:
print f, file.readline()