Search code examples
phpsimpledom

PHP Simple HTML DOM Parser returns gibberish


$html = file_get_html('http://www.livelifedrive.com/');  
echo $html->plaintext;

I've no problem scraping other websites but this particular one returns gibberish.
Is it encrypted or something?


Solution

  • Actually, the gibberish you see is a GZIPed content.

    When I fetch the content with hurl.it for instance, here are the headers returned by server:

    GET http://www.livelifedrive.com/malaysia/ (the url http://www.livelifedrive.com/ resolves to http://www.livelifedrive.com/malaysia/)
    
    Connection: keep-alive
    Content-Encoding: gzip  <--- The content is gzipped
    Content-Length: 18202
    Content-Type: text/html; charset=UTF-8
    Date: Tue, 31 Dec 2013 10:35:42 GMT
    P3p: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
    Server: nginx/1.4.2
    Vary: Accept-Encoding,User-Agent
    X-Powered-By: PHP/5.2.17
    

    So once you have scraped the content, unzip it. Here is a sample code:

    if ( ! function_exists('gzdecode'))
    {
        /**
         * Decode gz coded data
         * 
         * http://php.net/manual/en/function.gzdecode.php
         * 
         * Alternative: http://digitalpbk.com/php/file_get_contents-garbled-gzip-encoding-website-scraping
         * 
         * @param string $data gzencoded data
         * @return string inflated data
         */
        function gzdecode($data) 
        {
            // strip header and footer and inflate
    
            return gzinflate(substr($data, 10, -8));
        }
    }
    

    References: