Search code examples
phpzlib

Writing a chunked gzip file to an arbitrary output stream in PHP


My original code does this:

    $data = file_get_contents($source);
    $compressed_data = gzencode($data);
    file_put_contents($destination, $compressed_data);

This works fine, and it seemingly supports a lot of different values for $source and $destination - including in-memory file systems, stdin/stdout streams, etc.

However, large files have to be fully loaded into memory, so I'd like to switch this over to a chunked approach.

I've tried the following:

    $in = fopen($source, 'rb');
    $out = gzopen($destination, 'wb');
    while (!feof($in)) {
        gzwrite($out, fread($in, 4096));
    }

But this gives me an error with stream wrappers (such as https://packagist.org/packages/mikey179/vfsstream): gzopen(): cannot represent a stream of type user-space as a File Descriptor.

Also tried the following:

    $in = fopen($source, 'rb');
    $out = fopen($destination, 'wb');
    stream_filter_append($out, 'zlib.deflate', STREAM_FILTER_WRITE, -1);
    while (!feof($in)) {
        fwrite($out, fread($in, 4096));
    }

But the resulting output doesn't appear to be valid GZIP (missing header maybe?)

Finally I tried this:

    $in = fopen($source, 'rb');
    $out = fopen('compress.zlib://' . $destination, 'wb');
    while (!feof($in)) {
        fwrite($out, fread($in, 4096));
    }

But (unsurprisingly) this failed if $destination already had a wrapper (such as php://stdin or the vfs:// mentioned above).

There has to be a way to do this, but searching hasn't turned up any examples for it.


Solution

  • I have now reimplemented the specification for the GZip header and footer, which is the only thing missing when using stream_filter_append() with zlib.deflate (second solution above).

    The (minimal) header consists of ten bytes as defined by https://www.rfc-editor.org/rfc/rfc1952#page-6:

    1F 8B       // gzip format
    08          // deflate compression
    00          // flags
    00 00 00 00 // four bytes for the file's mtime, zero if inapplicable or after 2038
    00          // more flags
    03          // operating system (03 for linux)
    

    The footer consists of eight bytes: Four bytes for the CRC32 checksum of the uncompressed payload, and four bytes for the byte length of the payload (modulo 2^32).

    The CRC32 presents a further problem here, because PHP doesn't provide a way to calculate it without loading the entire payload into memory, which we are trying to avoid.

    I instead reimplemented Mark Adler's crc32_combine algorithm for using the CRC32 checksum of two strings (and the second string's length) to calculate the CRC32 checksum of their concatenation: https://github.com/madler/zlib/blob/v1.2.11/crc32.c#L372 This allows updating the CRC32 as each chunk is loaded and compressed.