Search code examples
phpscreen-scrapingsimple-html-dom

PHP Dom Scraping large amount of data


I have to gather some data from over 8000 pages x 25 records per page. That's about over 200.000 records. The problem is that the server rejects my requests after a period of time. Though I've heard it is rather slow, I used simple_html_dom as library for the scraping. This is the sample data:

<table>
<tr>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data1</td>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data2</td>
</tr>
<tr>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data3</td>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data4</td>
</tr>
</table>

And the php scraping script is:

<?php

$fileName = 'output.csv';

header("Cache-Control: must-revalidate, post-check=0, pre-check=0");
header('Content-Description: File Transfer');
header("Content-type: text/csv");
header("Content-Disposition: attachment; filename={$fileName}");
header("Expires: 0");
header("Pragma: public");

$fh = @fopen('php://output', 'w');


ini_set('max_execution_time', 300000000000);

include("simple_html_dom.php");

for ($i = 1; $i <= 8846; $i++) {

    scrapeThePage('url_to_scrape/?page=' . $i);
    if ($i % 2 == 0)
        sleep(10);

}

function scrapeThePage($page)
{

    global $theData;


    $html = new simple_html_dom();
    $html->load_file($page);

    foreach ($html->find('table tr') as $row) {
        $rowData = array();
        foreach ($row->find('td[style="font-size:12px;border-bottom:1px dashed #a2a2a2;"]') as $cell) {
            $rowData[] = $cell->innertext;

        }

        $theData[] = $rowData;
    }
}

foreach (array_filter($theData) as $fields) {
    fputcsv($fh, $fields);
}
fclose($fh);
exit();

?>

As you can see, I have added a 10 second sleep interval in the for loop so I won't stress the server with the requests. When it prompts me for the CSV download, I have these lines inside of it:

Warning: file_get_contents(url_to_scrape/?page=8846): failed to open stream: HTTP request failed! HTTP/1.0 500 Internal Server Error Fatal error: Call to a member function find() on a non-object in D:\www\htdocs\ucmr\simple_html_dom.php on line 1113

The 8846 page does exist and it is the last page of the script. The page number varies in the error above, so sometimes I receive an error at page 800 for example. Can someone please give me an idea of what am I doing wrong in this situation. Any advice would be helpful.


Solution

  • Fatal is thrown probably because $html or $row is not an object, it becames null. You should always try to check if object is properly created. Maybe also method $html->load_file($page); returns false if loading a page fails.

    Also get familiar with instanceof - it becames very helpful sometimes.

    Another edit: Your code has no data validation AT ALL. There is no place where you check for uninitialized variables, unloaded objects, or methods executed with errors. You should always use those in your code.