Search code examples
pythoncompressiongzip

Can gzip compress data without loading it all into memory, i.e. streaming/on-the-fly?


Is it possible to gzip data via some amount of streaming, i.e. without loading all of the compressed data in memory at once?

For example, can I gzip a file that will be 10gb gzipped, on a machine with 2gb of memory?

At https://docs.python.org/3/library/gzip.html#gzip.compress, the gzip.compress function returns the bytes of the gzip, so must be all loaded in memory. But... it's not clear how gzip.open works internally: whether the zipped bytes will all be in memory at once. Does the gzip format itself make it particularly tricky to achieve a streaming gzip?

[This question is tagged with Python, but non-Python answers welcome as well]


Solution

  • [This is based on @Barmar's answer and comments]

    You can achieve streaming gzip compression. The gzip module uses zlib which is documented to achieve streaming compression, and peeking into the gzip module source, it doesn't appear to load all the output bytes into memory.

    You can also do this directly with the zlib module, for example with a small pipeline of generators:

    import zlib
    
    def yield_uncompressed_bytes():
        # In a real case, would yield bytes pulled from the filesystem or the network
        chunk = b'*' * 65000
        for _ in range(0, 10000):
            print('In: ', len(chunk))
            yield chunk
    
    def yield_compressed_bytes(_uncompressed_bytes):
        compress_obj = zlib.compressobj(wbits=zlib.MAX_WBITS + 16)
        for chunk in _uncompressed_bytes:
            if compressed_bytes := compress_obj.compress(chunk):
                yield compressed_bytes
    
        if compressed_bytes := compress_obj.flush():
            yield compressed_bytes
    
    uncompressed_bytes = yield_uncompressed_bytes()
    compressed_bytes = yield_compressed_bytes(uncompressed_bytes)
    
    for chunk in compressed_bytes:
        # In a real case, could save to the filesystem, or send over the network
        print('Out:', len(chunk))
    

    You can see that the In: are interspersed with the Out:, suggesting that the zlib compressobj is indeed not storing the whole output in memory.