Search code examples
phpweb-scrapingsimple-html-dom

Simple html dom not fetching or loading the full html file despite meeting the memory limits


I'm running a scraper on localhost and am having trouble scraping a 2.50MB html file that's stored on a website directory on my computer.

Right now I have

  • 36MB memory is allocated
  • Memory usage of 18.93MB to fetch test.html
  • The test.html file being scraped is 2.50MB

error_reporting(E_ALL);
ini_set('display_errors', '1');

require_once 'simplehtmldom-2rc2/HtmlWeb.php';
use simplehtmldom\HtmlWeb;
$doc = new HtmlWeb();
$html = $doc->load('http://localhost/onetab/test.html');

I have a file called test.html that when I add 1 more character to it, my scraper fails to fetch the file.

Given the memory limit and memory usage stated above, how can adding one extra character to test.html cause the ->load function to fail so $html is blank (or null)?

I'm using Simple HTML Dom version 2 RC2.

Using the following lines does not help.

set_time_limit(0); // 0 is infinite, or it could be 5000
ini_set('max_input_time', 5000 );
ini_set('max_execution_time', 5000 );
ini_set('max_input_vars', 5000 );
ini_set('max_input_nesting_level', 5000 );

problems with simple html dom scraper v2rc2


Solution

  • In the Simple HTML DOM version 2 RC2 library there is a constants.php file with some settings to change. In it the MAX_FILE_SIZE constant (a type of variable) has to be changed.

    To make it accept a 9MB file I set the value to 1024 * 1024 * 9. You could just change the value to be the number or numerical sum you want, or you might even want to make it a variable like

    $chosenvalue = 1024 * 1024 * 9; //9mb file (bytes --> kilobytes --> megabytes)
    

    These instructions are mentioned in the manual/api/constants.md file. But because the library is still a release candidate while waiting for a final stable version release, the documentation is not written fully in a clear manner as an offline file. You can read the relevant documentation web page online.