Search code examples
phphtmlparsingsimple-html-dom

Simple HTML DOM returning false


I've encountered something strange when using Simple HTML DOM to parse a webpage with a certain query string. Some query strings work when trying to parse this used car page of a dealership's website, however others do not. It seems to be that whenever there are more vehicles to be shown on the page, it will not return the HTML content (meaning if we are on the last page of pagination it will work, otherwise it won't). Just wondering if anyone has any ideas. I've tried viewing the page with javascript disabled to see if the markup is different, but it seems like the page behaves similarly. Below is code if anyone has any ideas... Or better yet solutions. Thanks all!

require ('simple_html_dom.php');
error_reporting(E_ALL);
$startingURL = 'http://www.buickgmcofmilford.com/VehicleSearchResults?model=&certified=&location=&miles=&maxPrice=&minYear=&maxYear=&bodyType=&search=preowned&trim=&make=&pageNumber=2';
$getHTML = file_get_html($startingURL);
if ($getHTML == true){
    echo '<h1>TRUE</h1>';
    var_dump($getHTML);
}
else {
    echo '<h1>FALSE</h1>';
    var_dump($getHTML);
}

When using var_dump with the above URL it returns a boolean false. When using the following URL, I can parse the data no issue - http://www.buickgmcofmilford.com/VehicleSearchResults?model=&certified=&location=&miles=&maxPrice=&minYear=&maxYear=&bodyType=&search=preowned&trim=&make=&pageNumber=5

Thanks.


Solution

  • you should not use the default function file_get_html for getting remote content, that function use file_get_content to download page content. Sometime the target website will block your request by the user agent or referer. You could try PHP Curl to download page content first, then parse it with simple_html_dom