Search code examples
pythongzipdeterministic

Gzip output different after Python restart


I'm trying to gzip a numpy array in Python 3.6.8.

If I run this snippet twice (different interpreter sessions), I get different output:

import gzip
import numpy
import base64

data = numpy.array([[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0], [9.0, 10.0, 11.0, 12.0], [13.0, 14.0, 15.0, 16.0]])
compressed = base64.standard_b64encode(gzip.compress(data.data, compresslevel=9))
print(compressed.decode('ascii'))

Example results (it's different every time):

H4sIAPjHiV4C/2NgAIEP9gwQ4AChOKC0AJQWgdISUFoGSitAaSUorQKl1aC0BpTWgtI6UFoPShs4AABmfqWAgAAAAA==
H4sIAPrHiV4C/2NgAIEP9gwQ4AChOKC0AJQWgdISUFoGSitAaSUorQKl1aC0BpTWgtI6UFoPShs4AABmfqWAgAAAAA==
      ^

Running it in a loop (so the same interpreter session),it gives the same result each time

for _ in range(1000):
    assert compressed == base64.standard_b64encode(gzip.compress(data.data, compresslevel=9))

How can I get the same result each time? (Preferably without external libraries.)


Solution

  • Gzip uses some file information (inodes, timestamp, etc) when compressing (good discussion of that here). You are not using files per se but still you are doing it at different times. So that may have an effect (a look at Python's gzip wrapper would actually give a better insight but that is beyond me:)

    So try using the mtime=0 parameter in gzip.compress(data.data, compresslevel=9) if you have Python 3.8+, as

    gzip.compress(data.data, compresslevel=9, mtime=0)
    

    and if that does not work (e.g. older Python version), then you can use gzip.GzipFile with the mtime parameter, like this:

    buf = io.BytesIO()
    with GzipFile(fileobj=buf, mode='wb', compresslevel=compresslevel, mtime=0) as f:
        f.write(data)
    result = buf.getvalue()
    

    For details, the documentation is here: