Search code examples
phphtmldomsimple-html-dom

PHP simple_html_dom not parsing Apple wikipedia page correctly


I am trying to parse a Wikipedia page - and for some reason below code works for all Wikipedia pages (except the Apple Wikipedia page!!!)

include ('simple_html_dom.php');
$url = "http://en.wikipedia.org/wiki/Apple_Inc.";

$html = file_get_html($url);

Strlen for $html above returns 0 above for Apple.

Note: the above code works perfectly fine when $url is set to other Wikipedia pages for Microsoft - http://en.wikipedia.org/wiki/Microsoft - for Diageo - http://en.wikipedia.org/wiki/Diageo, etc

I want to use file_get_html - so that i can get it into a DOM object and process it further.


Solution

  • Change MAX_FILE_SIZE constant in simple_html_dom.php to, e.g.

    define('MAX_FILE_SIZE', 800000);
    

    and you are good to go... :) This is way you got '0' in case of apple page. Strlen is above limit...

    if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
    {
        return false;
    }