Search code examples
phpgzipzlibphp-stream-wrappers

PHP: create gz stream from plain file


I want to create a read stream that previously is gzencoded from a plain text.

The google cloud storage library has an upload function and you can pass a StreamInterface as parameter (Bucket::upload reference)

I want to upload a .txt file but gzencoded.

To upload a txt file is simple:

/** @var \Google\Cloud\Storage\Bucket $bucket */
$fd = fopen('/tmp/file.txt', 'r');
$stream = \GuzzleHttp\Psr7\Utils::streamFor($fd);
$bucket->upload($stream, ['name' => 'file.txt']);

I want to create a stream that:

  • reads the original plain txt file
  • does a gzencode in every chuck

And not storing the full file in memory (just the chunks) neither in disk. Is this possible?

I think it should be something like the following code, but creating a gz file (instead of zliz.deflating the data):

$fd = fopen('/tmp/file.txt', 'r');
stream_filter_append($fd, 'zlib.deflate', STREAM_FILTER_READ, ['window' => 15]);
$stream = Psr7\Utils::streamFor($fd);
$bucket->upload($stream, ['name' => 'file.txt.gz']);

Thanks!


Solution

  • I got a bit nerd-sniped and had to write something for this.

    Reposting my above comment:

    While DEFLATE is the algorithm used by gzip, it is not the format. This is laid out in the response to bugs.php.net/bug.php?id=68556. This stream filter appears to use the DEFLATE format header and trailer, and there does not currently seem to be a built-in gzip stream filter.

    Well we can shim in a call to the system's gzip binary with proc_open() and stream the data through that to create a properly-formatted gzip stream.

    class GzipCommandFilter extends php_user_filter {
    
        public $stream;
        private $ph, $pipes;
    
        public function onCreate(): bool {
    
            $this->ph = proc_open(
                [ 'gzip', '-c', '-'],
                [
                    ['pipe', 'r'],
                    ['pipe', 'w'],
                    ['pipe', 'w']
                ],
                $this->pipes
            );
    
            if( $this->ph === false ) {
                return false;
            }
    
            stream_set_blocking($this->pipes[1], false);
            stream_set_blocking($this->pipes[2], false);
    
            return true;
        }
    
        public function filter($in, $out, &$consumed, $closing): int {
            $written = 0;
    
            while ($bucket = stream_bucket_make_writeable($in)) {
                fwrite($this->pipes[0], $bucket->data);
                $consumed += $bucket->datalen;
    
                $out_buf = stream_get_contents($this->pipes[1]);
                $written += strlen($out_buf);
                $bucket->data = $out_buf;
                stream_bucket_append($out, $bucket);
            }
    
            if( $closing ) {
                fclose($this->pipes[0]); // closing stdin to signal completion
                $this->waitOnProc(); // let gzip process the remaining buffer
                stream_bucket_append($out, stream_bucket_new($this->stream, stream_get_contents($this->pipes[1])));
                return PSFS_PASS_ON;
            } else if( $written > 0 ) {
                return PSFS_PASS_ON;
            } else {
                return PSFS_FEED_ME;
            }
        }
    
        protected function waitOnProc($step=1000, $max=1000000) {
            $waited = 0;
            while( ($status = proc_get_status($this->ph))['running'] === true ) {
                usleep($step);
                $waited += $step;
                if( $waited >= $max ) {
                    throw new \Exception('Timed out while waiting.');
                }
            }
        }
    }
    
    stream_filter_register('gzip', 'GzipCommandFilter');
    

    and we would use it like:

    $fh = fopen('/tmp/file.txt', 'rb');
    stream_filter_append($fh, 'gzip');
    $data = stream_get_contents($fh);
    printf("Data: %s\nDecoded: %s\n", bin2hex($data), gzdecode($data));
    

    Which might output something like:

    Data: 1f8b0800000000000003cb48cdc9c95728cf2fca495104006dc2b4030c000000
    Decoded: hello world!