Search code examples
phpasynchronousiochild-process

Stream run-time generated gzip file with proc_open


I'm trying to stream a tar.gz without buffering anything in memory or saving data do disk. I need to gzip a bunch of PDF files (~100kb per file).

Everything seems to work fine if small 10-20 bytes text files are sent through the script and the user downloads a readable tar.gz file, but when sending real data (run-time generated PDF files) the script blocks and stops

Below is a snippet of the code. Why is the script blocking when writing to stdin after a couple of iterations of the loop? It stops at this point waiting for something

Every step is logged to a file to see the message before writing to stdin is the last logged message

$proc = proc_open('gzip - -c', [
    0   => ['pipe', 'r'],
    1   => ['pipe', 'w'],
    2   => ['pipe', 'w']
], $pipes);

stream_set_read_buffer($pipes[1], 0);
stream_set_read_buffer($pipes[2], 0);

stream_set_blocking($pipes[1], false);
stream_set_blocking($pipes[2], false);

while(true){
    log_step('file stream');
    // fetching data from database and generating PDF file as tar stream (string)

    log_step('stdin: '.strlen($tar_string));
    fwrite($pipes[0], $tar_string); // <--- in the second iteration the script blocks/stops here!
    log_step('stdin done!');
    
    if($output = stream_get_contents($pipes[1])){
        log_step('output: '.strlen($output));
        echo $output;
    }
}

Output log file

2021-01-26 10:28:29 file stream
2021-01-26 10:28:29 stdin: 116224
2021-01-26 10:28:29 stdin done!
2021-01-26 10:28:29 output: 32768
2021-01-26 10:28:29 file stream
2021-01-26 10:28:29 stdin: 116736

full code

$proc = proc_open('gzip - -c', [
    0   => ['pipe', 'r'],
    1   => ['pipe', 'w'],
    2   => ['pipe', 'w']
], $pipes);
stream_set_read_buffer($pipes[1], 0);
stream_set_read_buffer($pipes[2], 0);
stream_set_blocking($pipes[1], false);
stream_set_blocking($pipes[2], false);

//  get data from database
while($row = $result->fetch()){
    //  generate PDF

    $filename = $pdf['name'];
    $filesize = strlen($pdf['data']);

    $header = pack(
        'a100a8a8a8a12A12a8a1a100a255',
        $filename,
        sprintf('%6s ',     ''),
        sprintf('%6s ',     ''),
        sprintf('%6s ',     ''),
        sprintf('%11s ',    $filesize),
        sprintf('%11s',     ''),
        sprintf('%8s ',     ' '),
        0,
        '',
        ''
    );
    
    $checksum = 0;
    for($i=0; $i<512; $i++){
        $checksum += ord($header{$i});
    }
    
    $checksum_data = pack(
        'a8',
        sprintf('%6s ',     decoct($checksum))
    );
    
    for($i=0, $j=148; $i<8; $i++, $j++){
        $header{$j} = $checksum_data{$i};
    }
    
    fwrite($pipes[0], $header.$pdf['data'].pack(
        'a'.(512 * ceil($filesize / 512) - $filesize),
        ''
    ));
    
    if($output = stream_get_contents($pipes[1])){
        echo $output;
    }
}

fwrite($pipes[0], pack('a512', ''));
fclose($pipes[0]);

while(true){
    if($output = stream_get_contents($pipes[1])){
        echo $output;
    }
    
    if(!proc_get_status($proc)['running']){
        foreach($pipes as $pipe){
            if(is_resource($pipe)){
                fclose($pipe);
            }
        }
        proc_close($proc);
        
        break;
    }
}

Solution

  • The reason your script doesn’t progress is that it is attempting to write more data into the pipe than the gzip process is able to handle at once. The situation looks roughly like this:

    1. Your script writes 116736 bytes into the pipe.
    2. The gzip process reads some of it from its standard input, compresses it, and outputs compressed data on its standard output.
    3. The PHP process is blocked until the gzip process reads the rest of the input it wrote to the pipe.
    4. The gzip process is blocked until the PHP process reads the compressed output it wrote to standard output.

    And so your script finds itself in a deadlock.

    The root of the problem is that unlike its namesake in C, the PHP fwrite function in blocking mode will always attempt to write the entirety of the buffer to the stream until everything is written. This can be worked around by enabling non-blocking mode on the standard input pipe as well, and monitoring how much input has been actually written. For example like this:

    $proc = proc_open('gzip -c -', [
        0 => ['pipe', 'r'],
        1 => ['pipe', 'w'],
    ], $pipes);
    
    stream_set_read_buffer($pipes[1], 0);
    
    stream_set_blocking($pipes[0], false);
    stream_set_blocking($pipes[1], false);
    
    $tar_string = '';
    for (;;) {
        if ($tar_string === '') {
            if (/* more input available */)
                $tar_string = /* read more input */;
            else {
                $tar_string = null;
                \fclose($pipes[0]);
            }
        }
    
        if ($tar_string !== null) {
            $written = \fwrite($pipes[0], $tar_string);
            if ($written === false)
                throw new \Exception('write error');
            $tar_string = \substr($tar_string, $written);
        }
    
        /* THIS IS JUST SOME DUMB DEMONSTRATIVE CODE, DO NOT COPY-PASTE */
    
        for (;;) {
            $outbuf = \fread($pipes[1], 69420);
            if ($outbuf === false)
                throw new \Exception('read error');
            if ($outbuf === '')
                break;
            $outlen = \strlen($outbuf);
            echo $outbuf;
        }
        
        if (\feof($pipes[1]))
            break;
    }
    

    The above will superficially work. A big downside is that it is going to perform extremely poorly: when the gzip process is ready neither to read or write any data, the script is going to keep busy-looping uselessly and take away CPU time from the gzip process which actually needs it.

    In a saner programming language, you would have access to:

    • calls such as poll or select, which are able to signal when a stream is ready to be read from or written into, and otherwise give up CPU time to other processes which may need it;
    • I/O primitives that can return immediately upon a successful partial read or write, instead of trying to process the entire size of the buffer.

    But this is PHP, so we can’t have nice things. At least not built in.

    There is, however, a much better solution for this problem that avoids proc_open entirely, and instead implements gzip compression using the zlib extension, like this:

    $zctx = \deflate_init(ZLIB_ENCODING_GZIP);
    if ($zctx === false)
        throw new \Exception('deflate_init failed');
    
    while (/* more data available */) {
        $input = /* get more data */;
        $data = \deflate_add($zctx, $input, ZLIB_NO_FLUSH);
        if ($data === false)
            throw new \Exception('deflate_add failed');
        echo $data;
    }
    
    $data = \deflate_add($zctx, '', ZLIB_FINISH);
    if ($data === false)
        throw new \Exception('deflate_add failed');
    echo $data;
    
    unset($zctx); // free compressor resources
    

    deflate_init and deflate_add are available since PHP 7, assuming that the zlib extension was enabled while building PHP. Calling a library is preferable to invoking a subprocess (in any language, in fact) as it is much more lightweight: putting everything in the same process avoids memory and context-switching overheads.