Search code examples
pythonstreamingzlibgunzipzcat

How to stream a gzip built on the fly in Python?


I'd like to stream a big log file over the network using asyncio. I retrieve the data from the database, format it, compress it using python's zlib and stream it over the network.

Here is basically the code I use:

@asyncio.coroutine
def logs(requests):
    # ...

    yield from resp.prepare(request)

    # gzip magic number and compression format
    resp.write(b'\x1f\x8b\x08\x00\x00\x00\x00\x00')
    compressor = compressobj()
    for row in rows:
        ip, uid, date, url, answer, volume = row
        NCSA_ROW = '{} {} - [{}] "GET {} HTTP/1.0" {} {}\n'
        row = NCSA_ROW.format(ip, uid, date, url, answer, volume)
        row = row.encode('utf-8')
        data = compressor.compress(row)
        resp.write(data)
    resp.write(compressor.flush())
    return resp

The file that I retrieve can not be opened with gunzip and zcat raise the following error:

gzip: out.gz: unexpected end of file

Solution

  • Your gzip header is wrong (8 bytes instead of 10), and you follow it with a zlib stream which uses a different header and trailer. Even had you had a correct gzip header, and if you had a raw deflate stream instead of a gzip stream, you would still have not written a gzip trailer.

    To do this right, do not attempt to write your own gzip header. Instead request that zlib write a complete gzip stream, which will write the correct header, compressed data, and trailer. You can do this by providing a wbits value of 31 to compressobj().