Search code examples
pythongzip

python subprocess with gzip


I am trying to stream data through a subprocess, gzip it and write to a file. The following works. I wonder if it is possible to use python's native gzip library instead.

fid = gzip.open(self.ipFile, 'rb') # input data
oFid = open(filtSortFile, 'wb') # output file
sort = subprocess.Popen(args="sort | gzip -c ", shell=True, stdin=subprocess.PIPE, stdout=oFid) # set up the pipe
processlines(fid, sort.stdin, filtFid) # pump data into the pipe

THE QUESTION: How do I do this instead .. where the gzip package of python is used? I'm mostly curious to know why the following gives me a text files (instead of a compressed binary version) ... very odd.

fid = gzip.open(self.ipFile, 'rb')
oFid = gzip.open(filtSortFile, 'wb')
sort = subprocess.Popen(args="sort ", shell=True, stdin=subprocess.PIPE, stdout=oFid)
processlines(fid, sort.stdin, filtFid)

Solution

  • subprocess writes to oFid.fileno() but gzip returns fd of underlying file object:

    def fileno(self):
        """Invoke the underlying file object's fileno() method."""
        return self.fileobj.fileno()
    

    To enable compression use gzip methods directly:

    import gzip
    from subprocess import Popen, PIPE
    from threading import Thread
    
    def f(input, output):
        for line in iter(input.readline, ''):
            output.write(line)
    
    p = Popen(["sort"], bufsize=-1, stdin=PIPE, stdout=PIPE)
    Thread(target=f, args=(p.stdout, gzip.open('out.gz', 'wb'))).start()
    
    for s in "cafebabe":
        p.stdin.write(s+"\n")
    p.stdin.close()
    

    Example

    $ python gzip_subprocess.py  && od -c out.gz && zcat out.gz 
    0000000 037 213  \b  \b 251   E   t   N 002 377   o   u   t  \0   K 344
    0000020   J 344   J 002 302   d 256   T       L 343 002  \0   j 017   j
    0000040   k 020  \0  \0  \0
    0000045
    a
    a
    b
    b
    c
    e
    e
    f