Search code examples
phpcurlcompressionbrotli

getting content of page with 'br' encoding and decoding it by php curl


I want to get content of this page by php curl:

my curl sample:

function curll($url,$headers=null){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL,$url);


    if ($headers){

        curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
    }

    curl_setopt($ch, CURLOPT_ENCODING, '');
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0');
    curl_setopt($ch, CURLOPT_HEADER, 1);
    curl_setopt($ch, CURLINFO_HEADER_OUT, true);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
    curl_setopt($ch, CURLOPT_TIMEOUT, 60);

    $response = curl_exec($ch);

    $res['headerout'] = curl_getinfo($ch,CURLINFO_HEADER_OUT);
    $res['rescode'] = curl_getinfo($ch, CURLINFO_HTTP_CODE);

    if ($response === false) {
        $res['content'] = $response;
        $res['error'] = array(curl_errno($ch),curl_error($ch));
        return $res;
    }

    $header_size = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
    $res['headerin'] = substr($response, 0, $header_size);
    $res['content'] = substr($response, $header_size);

    return $res;

}

response:

array (size=4)
  'headerout' => string 'GET /wallets HTTP/1.1
Host: www.cryptocompare.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: br
Accept-Language: en-US,en;q=0.5
Connection: keep-alive
Upgrade-Insecure-Requests: 1

' (length=327)
  'rescode' => string '200' (length=3)
  'content' => boolean false
  'error' => 
    array (size=2)
      0 => int 23
      1 => string 'Unrecognized content encoding type. libcurl understands deflate, gzip content encodings.' (length=88)

response encoding is br and response content is false

I am aware that using gzip or deflate as encoding would get me a content. However, the content that I have in mind is only shown by br encoding.

I read on this page that Curl V7.57.0 supports the Brotli Compression Capability. I currently have version 7.59.0 installed, but Curl encounters an error as it recieves content in br encoding.

now I want to know how can I get content of a page with br encoding and uncompress it by php curl ?


Solution

  • I had the exact same issue because one server was only able to return brotli and my PHP Curl-bundled version didn't support Brotli. I had to use a PHP extension: https://github.com/kjdev/php-ext-brotli

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, 'URL');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $output_brized = curl_exec($ch);
    
    $output_ok = brotli_uncompress($output_brized);
    

    I checked and, with PHP 7.4.9 on Windows with bundled Curl version 7.70.0, setting the CURLOPT_ENCODING option to '' (like you did) forced the bundled Curl to do the request with one additionnal header accept-encoding: deflate, gzip which are the content encodings the bundled Curl can decode. If I omited this option, there was just 2 headers: Host: www.google.com and accept: */*.

    Indeed, searching the PHP source code (https://github.com/php/php-src/search?q=CURLOPT_ENCODING) for this CURLOPT_ENCODING option lead to nothing that may set a default value or change value from PHP. PHP sends the option value to Curl without altering it so what I am observing is the default behavior of my bundled Curl version.

    I then discovered Curl supports Brotli from version 7.57.0 (https://github.com/curl/curl/blob/bf1571eb6ff24a8299da7da84408da31f0094f66/docs/libcurl/symbols-in-versions) from november 2018 (https://github.com/curl/curl/blob/fd1ce3d4b085e7982975f29904faebf398f66ecd/docs/HISTORY.md) but requires to be compiled with a --with-brotli flag (https://github.com/curl/curl/blob/9325ab2cf98ceca3cf3985313587c94dc1325c81/configure.ac) which was probably not used for my PHP version.

    Unfortunately, there is no curl_getopt() function to get the default value of an option. But, phpinfo() gives a valuable info as I got a BROTLI => No line which confirms my version was not compiled with Brotli support. You may want to check your phpinfo to find out if your Curl-bundled version should support Brotli. If it doesn't, use my solution. If it does, more investigation need to be done to find out if it's a bug or a misuse.

    If you want to know what your Curl sent, you have to use a proxy like Charles/Fiddler or use Curl verbose mode.

    Additionnaly, for the sake of completness, in the HTTP1/1 specs (https://www.rfc-editor.org/rfc/rfc2616#page-102), it's said:

       If an Accept-Encoding field is present in a request, and if the
       server cannot send a response which is acceptable according to the
       Accept-Encoding header, then the server SHOULD send an error response
       with the 406 (Not Acceptable) status code.
    
       If no Accept-Encoding field is present in a request, the server MAY
       assume that the client will accept any content coding.
    

    So, if your PHP version behaved the same as mine, the website should have received a Accept-Encoding not containing br so should NOT have replied with a br content and, instead, should have replied with a gzip or deflate content or, if it was not able to do so, replied with a 406 Not Acceptable instead of a 200.