Search code examples
phpstringcompression

How to determine if a string was compressed?


How can I determine whether a string was compressed with gzcompress (aparts from comparing sizes of string before/after calling gzuncompress, or would that be the proper way of doing it) ?


Solution

  • PRE:
    I guess, if you send a request, you can immediately look into $http_response_header to see if the one of the items in the array is a variation of Content-Encoding: gzip. But this is not ideal!
    there is a far better method.


    Here is HOW TO...

    Check if its GZIP. Like a BOSS!

    according to GZIP RFC:

    The header of gzip content looks like this

    +---+---+---+---+---+---+---+---+---+---+
    |ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)
    +---+---+---+---+---+---+---+---+---+---+
    

    the ID1 and ID2 identify the content as GZIP. And CM states that the ZLIB_ENCODING (the compression method) is ZLIB_ENCODING_DEFLATE - which is customarily used by GZIP with all web-servers.

    oh! and they have fixed values:

    • The value of ID1 is "\x1f"
    • The value of ID2 is "\x8b"
    • The value of CM is "\x08" (or just 8...)

    almost there:

    `$is_gzip = 0 === mb_strpos($mystery_string , "\x1f" . "\x8b" . "\x08");`

    Working example

    <?php
    /** @link https://gist.github.com/eladkarako/d8f3addf4e3be92bae96#file-checking_gzip_like_a_boss-php */
    
    date_default_timezone_set("Asia/Jerusalem");
    
    while (ob_get_level() > 0) ob_end_flush();
    mb_language("uni");
    @mb_internal_encoding('UTF-8');
    setlocale(LC_ALL, 'en_US.UTF-8');
    
    header('Time-Zone: Asia/Jerusalem');
    header('Charset: UTF-8');
    header('Content-Encoding: UTF-8');
    header('Content-Type: text/plain; charset=UTF-8');
    header('Access-Control-Allow-Origin: *');
    
    function get($url, $cookie = '') {
      $html = @file_get_contents($url, false, stream_context_create([
        'http' => [
          'method' => "GET",
          'header' => implode("\r\n", [''
            , 'Pragma: no-cache'
            , 'Cache-Control: no-cache'
            , 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2310.0 Safari/537.36'
            , 'DNT: 1'
            , 'Accept-Language: en-US,en;q=0.8'
            , 'Accept: text/plain'
            , 'X-Forwarded-For: ' . implode(', ', array_unique(array_filter(array_map(function ($item) { return filter_input(INPUT_SERVER, $item, FILTER_SANITIZE_SPECIAL_CHARS); }, ['HTTP_X_FORWARDED_FOR', 'REMOTE_ADDR', 'HTTP_CLIENT_IP', 'SERVER_ADDR', 'REMOTE_ADDR']), function ($item) { return null !== $item; })))
            , 'Referer: http://eladkarako.com'
            , 'Connection: close'
            , 'Cookie: ' . $cookie
            , 'Accept-Encoding: gzip'
          ])
        ]]));
    
      $is_gzip = 0 === mb_strpos($html, "\x1f" . "\x8b" . "\x08", 0, "US-ASCII");
    
      return $is_gzip ? zlib_decode($html, ZLIB_ENCODING_DEFLATE) : $html;
    }
    
    $html = get('http://www.pogdesign.co.uk/cat/');
    
    echo $html;
    

    What do we see here that is worth mentioning?

    • start with initializing the PHP engine to use UTF-8 (since we don't really know if the web-server will return a GZIP content.
    • Providing the header Accept-Encoding: gzip, tells the web-sever, it may output a GZIP content.
    • Discovering GZIP content (you should use the multi-byte functions with ASCII encoding).
    • Finally returning the plain output, is easy using the ZLIB methods.