Search code examples
pythonpython-3.xgzipzlibdeflate

Deflate string with gzip or zlib in Python - why am I missing the "H4sIAAAAAAAA/" bit


I am trying to compress a string in Python, but my result is not what I expected.

The string I am trying to compress for example:

<?xml version='1.0' encoding='UTF-8'?>

Here is what my end result should be:

H4sIAAAAAAAA/7Oxr8jNUShLLSrOzM+zVTfUM1BXSM1Lzk/JzEu3VQ8NcdO1ULe3AwBHQvxaJgAAAA==

First try:

base64.b64encode(gzip.compress("<?xml version='1.0' encoding='UTF-8'?>".encode('utf-8')))

Results in:

b'H4sIAHDj6lsC/7Oxr8jNUShLLSrOzM+zVTfUM1BXSM1Lzk/JzEu3VQ8NcdO1ULe3AwBHQvxaJgAAAA=='

The result is almost what I am looking for, but the header part is different. Both results (my one and the expected one) decompress to the same string, so they both seem to work. I still would like to know why I am not getting the correct header in the base64 compressed string.

Could I maybe get a better result using zlib? I tried, but got a completely different result, which worked when decompressed, too.


Solution

  • You have exactly the same compressed data stream. The only difference is that your expected data stream has the MTIME field of the header set to 0 and the XFL flag set to 0, not 2:

    >>> from base64 import b64decode
    >>> expected = b64decode('H4sIAAAAAAAA/7Oxr8jNUShLLSrOzM+zVTfUM1BXSM1Lzk/JzEu3VQ8NcdO1ULe3AwBHQvxaJgAAAA==')
    >>> actual = b64decode('H4sIAHDj6lsC/7Oxr8jNUShLLSrOzM+zVTfUM1BXSM1Lzk/JzEu3VQ8NcdO1ULe3AwBHQvxaJgAAAA==')
    >>> expected[:4] == actual[:4]  # identification, compression method and flag match
    True
    >>> expected[4:8], actual[4:8]  # mtime bytes differ, zero vs. current time
    (b'\x00\x00\x00\x00', b'p\xe3\xea[')
    >>> from datetime import datetime
    >>> print(datetime.fromtimestamp(int.from_bytes(actual[4:8], 'little')))
    2018-11-13 14:45:04
    >>> expected[8], actual[8]  # XFL is set to 2 in the actual output
    (0, 2)
    >>> expected[9], actual[9]  # OS set to *unknown* in both
    (255, 255)
    >>> expected[10:] == actual[10:]  # compressed data payload is the same
    True
    

    The gzip.compress() function just uses the gzip.GzipFile() class to do the actual compressing, and it'll use time.time() for the MTIME field whenever the mtime argument is left to the default None.

    I'd not expect that to actually matter, both strings will result in the exact same decompressed data.

    If you must have the same output, then the easiest method is to just replace the header:

    compressed = gzip.compress("<?xml version='1.0' encoding='UTF-8'?>".encode('utf-8'))
    result = base64.b64encode(b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\xff' + compressed[10:])
    

    The above replaces the existing header with one that will have the parts that matter set to the same values as your expected output; both MTIME and the XFL flag set to 0. Note that when you use gzip.compress() that only the MTIME bytes would ever vary, and the XFL field is not actually used when decompressing.

    While you could use the gzip.GzipFile() class to produce compressed output with MTIME set to 0 (pass in mtime=0), you can't change what the XFL field is set to; that is currently hard-coded to 2.

    Note that even accounting for the MTIME and XFL differences, like data compressed with different implementations of the DEFLATE compression algorithm could still result in a different compressed stream, even when using the same compression settings! That's because DEFLATE encodes data based on the frequency of snippets, and different implementations are free to make different choices when there are multiple snippets with the same frequency available when compressing. So the only correct way to test if your data has been compressed correctly, is to decompress again and compare the result.