My original code does this:
$data = file_get_contents($source);
$compressed_data = gzencode($data);
file_put_contents($destination, $compressed_data);
This works fine, and it seemingly supports a lot of different values for $source
and $destination
- including in-memory file systems, stdin/stdout streams, etc.
However, large files have to be fully loaded into memory, so I'd like to switch this over to a chunked approach.
I've tried the following:
$in = fopen($source, 'rb');
$out = gzopen($destination, 'wb');
while (!feof($in)) {
gzwrite($out, fread($in, 4096));
}
But this gives me an error with stream wrappers (such as https://packagist.org/packages/mikey179/vfsstream): gzopen(): cannot represent a stream of type user-space as a File Descriptor
.
Also tried the following:
$in = fopen($source, 'rb');
$out = fopen($destination, 'wb');
stream_filter_append($out, 'zlib.deflate', STREAM_FILTER_WRITE, -1);
while (!feof($in)) {
fwrite($out, fread($in, 4096));
}
But the resulting output doesn't appear to be valid GZIP (missing header maybe?)
Finally I tried this:
$in = fopen($source, 'rb');
$out = fopen('compress.zlib://' . $destination, 'wb');
while (!feof($in)) {
fwrite($out, fread($in, 4096));
}
But (unsurprisingly) this failed if $destination
already had a wrapper (such as php://stdin
or the vfs://
mentioned above).
There has to be a way to do this, but searching hasn't turned up any examples for it.
I have now reimplemented the specification for the GZip header and footer, which is the only thing missing when using stream_filter_append()
with zlib.deflate
(second solution above).
The (minimal) header consists of ten bytes as defined by https://www.rfc-editor.org/rfc/rfc1952#page-6:
1F 8B // gzip format
08 // deflate compression
00 // flags
00 00 00 00 // four bytes for the file's mtime, zero if inapplicable or after 2038
00 // more flags
03 // operating system (03 for linux)
The footer consists of eight bytes: Four bytes for the CRC32 checksum of the uncompressed payload, and four bytes for the byte length of the payload (modulo 2^32).
The CRC32 presents a further problem here, because PHP doesn't provide a way to calculate it without loading the entire payload into memory, which we are trying to avoid.
I instead reimplemented Mark Adler's crc32_combine
algorithm for using the CRC32 checksum of two strings (and the second string's length) to calculate the CRC32 checksum of their concatenation: https://github.com/madler/zlib/blob/v1.2.11/crc32.c#L372
This allows updating the CRC32 as each chunk is loaded and compressed.