Search code examples
pythoncompression

Python ungzipping stream of bytes?


Here is the situation:

  • I get gzipped xml documents from Amazon S3

      import boto
      from boto.s3.connection import S3Connection
      from boto.s3.key import Key
      conn = S3Connection('access Id', 'secret access key')
      b = conn.get_bucket('mydev.myorg')
      k = Key(b)
      k.key('documents/document.xml.gz')
    
  • I read them in file as

      import gzip
      f = open('/tmp/p', 'w')
      k.get_file(f)
      f.close()
      r = gzip.open('/tmp/p', 'rb')
      file_content = r.read()
      r.close()
    

Question

How can I ungzip the streams directly and read the contents?

I do not want to create temp files, they don't look good.


Solution

  • Yes, you can use the zlib module to decompress byte streams:

    import zlib
    
    def stream_gzip_decompress(stream):
        dec = zlib.decompressobj(32 + zlib.MAX_WBITS)  # offset 32 to skip the header
        for chunk in stream:
            rv = dec.decompress(chunk)
            if rv:
                yield rv
        if dec.unused_data:
            # decompress and yield the remainder
            yield dec.flush()
    

    The offset of 32 signals to the zlib header that the gzip header is expected but skipped.

    The S3 key object is an iterator, so you can do:

    for data in stream_gzip_decompress(k):
        # do something with the decompressed data