I am attempting to combine multiple gzip
streams into a single stream, my understanding is this should be possible but my implementation is flawed.
Based on what I have read, my expectation was that I should be able remove the 10byte header and 8 byte footer from all streams, concatenate the bytes together and reconstruct a header and footer.
However, when I do try and do this the decompression operation fails, I am assuming this is because the .flush()
is including some information in the block about "end of data" that is not being removed.
It is possible to concatenate multiple gzip
streams together without altering them. This is a valid gzip
file containing multiple streams.
Unfortunately, when using zlib.decompress(data, GZIP_WBITS)
, rather than using decompressobj
to check for an unconsumed_tail
, only the first stream is returned.
Example to show how concatenation might break some downstream clients consuming these files.
import zlib
GZIP_WBITS = 16 + zlib.MAX_WBITS
def decompress(data: bytes) -> bytes:
return zlib.decompress(data, GZIP_WBITS)
def compress(data: list[bytes]) -> bytes:
output = b""
for datum in data:
deflate = zlib.compressobj(8, zlib.DEFLATED, GZIP_WBITS)
output += deflate.compress(datum)
output += deflate.flush()
return output
def test_decompression():
data = [b"Hello", b"World!"]
compressed = compress(data)
decompressed = decompress(compressed)
# this should be b"".join(data) == decompressed
assert decompressed == data[0]
import zlib
import struct
test_bytes = b"hello world"
# create an GZIP example stream
deflate = zlib.compressobj(8, zlib.DEFLATED, GZIP_WBITS)
single = deflate.compress(test_bytes)
single += deflate.flush()
# quick sanity check that decompression works
zlib.decompress(single, GZIP_WBITS)
print("Single:", single.hex())
# check our understanding of the footer is correct
single_len = struct.unpack("<I", single[-4:])[0]
assert single_len == len(test_bytes), "wrong len"
single_crc = struct.unpack("<I", single[-8:-4])[0]
assert single_crc == zlib.crc32(test_bytes), "wrong crc"
# Create an example GZIP stream with duplicated input bytes
deflate = zlib.compressobj(8, zlib.DEFLATED, GZIP_WBITS)
double = deflate.compress(test_bytes)
double += deflate.compress(test_bytes)
double += deflate.flush()
# quick sanity check that decompression works
zlib.decompress(double, GZIP_WBITS)
# Check we can calculate the len and bytes correctly
double_length = struct.unpack("<I", double[-4:])[0]
assert double_length == len(test_bytes + test_bytes), "wrong len"
double_crc = struct.unpack("<I", double[-8:-4])[0]
assert double_crc == zlib.crc32(test_bytes + test_bytes), "wrong crc"
print(f"Double: {double.hex()}")
# Remove the header and footer from our original GZIP stream
single_data = single[10:-8]
print(f" Data: {' '*20}{single_data.hex()}")
# Concatenate the original stream with footer removed with a duplicate
# with the header and footer removed
concatenated = single[:-8] + single_data
# Add the footer, comprising the CRC and Length
concatenated += struct.pack("<I", double_crc)
concatenated += struct.pack("<I", double_length)
assert concatenated .startswith(single[:-8])
print(f" Maybe: {concatenated.hex()}")
# Confirm this is bad data
zlib.decompress(concatenated, GZIP_WBITS)
My assumption is it will be possible to use the following function to combine the crc32 values:
def crc_combine(crcA, crcB, lenB):
crcA0 = zlib.crc32(b'\0' * lenB, crcA ^ 0xffffffff) ^ 0xffffffff
return crcA0 ^ crcB
crc32_combine
function.zlib.decompress(data, GZIP_WBITS)
as the resultant files form part of a "public interface" and this would be considered a breaking change.This is much easier than you're making it out to be. Simply concatenate the gzip files without removing or in any way messing with the headers and trailers. Any concatenation of gzip streams is a valid gzip stream, and will decompress to the concatenation of the uncompressed contents of the individual gzip streams.