Search code examples
pythonlogginggzipcompressionzcat

How to test a directory of files for gzip and uncompress gzipped files in Python using zcat?


I'm in my 2nd week of Python and I'm stuck on a directory of zipped/unzipped logfiles, which I need to parse and process.

Currently I'm doing this:

import os
import sys
import operator
import zipfile
import zlib
import gzip
import subprocess

if sys.version.startswith("3."):
    import io
    io_method = io.BytesIO
else:
    import cStringIO
    io_method = cStringIO.StringIO

for f in glob.glob('logs/*'):
    file = open(f,'rb')        
    new_file_name = f + "_unzipped"
    last_pos = file.tell()

    # test for gzip
    if (file.read(2) == b'\x1f\x8b'):
        file.seek(last_pos)

    #unzip to new file
    out = open( new_file_name, "wb" )
    process = subprocess.Popen(["zcat", f], stdout = subprocess.PIPE, stderr=subprocess.STDOUT)

    while True:
      if process.poll() != None:
        break;

    output = io_method(process.communicate()[0])
    exitCode = process.returncode


    if (exitCode == 0):
      print "done"
      out.write( output )
      out.close()
    else:
      raise ProcessException(command, exitCode, output)

which I've "stitched" together using these SO answers (here) and blogposts (here)

However, it does not seem to work, because my test file is 2.5GB and the script has been chewing on it for 10+mins plus I'm not really sure if what I'm doing is correct anyway.

Question:
If I don't want to use GZIP module and need to de-compress chunk-by-chunk (actual files are >10GB), how do I uncompress and save to file using zcat and subprocess in Python?

Thanks!


Solution

  • This should read the first line of every file in the logs subdirectory, unzipping as required:

    #!/usr/bin/env python
    
    import glob
    import gzip
    import subprocess
    
    for f in glob.glob('logs/*'):
      if f.endswith('.gz'):
        # Open a compressed file. Here is the easy way:
        #   file = gzip.open(f, 'rb')
        # Or, here is the hard way:
        proc = subprocess.Popen(['zcat', f], stdout=subprocess.PIPE)
        file = proc.stdout
      else:
        # Otherwise, it must be a regular file
        file = open(f, 'rb')
    
      # Process file, for example:
      print f, file.readline()