Search code examples
phphtmlparsingdomsimple-html-dom

Simple html dom always loading the default first page and not the specified url


I want to scrape few web pages. I am using php and simple html dom parser. For instance trying to scrape this site: https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=5

I use this load the url.

$html = new simple_html_dom();
$html->load_file($url);

This loads the correct page. Then I find the next page link, here it will be: https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=6

Just the page value is changed from 5 to 6. The code snippet to get the next link is:

function getNextLink($_htmlTemp)
{
    //Getting the next page links
    $aNext = $_htmlTemp->find('a.next', 0);
    $nextLink = $aNext->href;    
    return $nextLink;
}

The above method returns the correct link with page value being 6. Now when I try to load this next link, it fetches the first default page with page query absent from the url.

//After loop we will have details of all the listing in this page -- so get next page link
    $nxtLink = getNextLink($originalHtml);  //Returns string url
    if(!empty($nxtLink))
    {
        //Yay, we have the next link -- load the next link        
        print 'Next Url: '.$nxtLink.'<br>'; //$nxtLink has correct value
        $originalHtml->load_file($nxtLink); //This line fetches default page
    }

The whole flow is something like this:

 $html->load_file($url);


//Whole thing in a do-while loop
$originalHtml = $html;
$shouldLoop = true;
//Main Array
$value = array();
do{
    $listings = $originalHtml->find('div.searchResult');    
    foreach($listings as $item)
    {
        //Some logic here
    }


    //After loop we will have details of all the listing in this page -- so get next page link
    $nxtLink = getNextLink($originalHtml);  //Returns string url
    if(!empty($nxtLink))
    {
        //Yay, we have the next link -- load the next link        
        print 'Next Url: '.$nxtLink.'<br>';
        $originalHtml->load_file($nxtLink);
    }
    else
    {
        //No next link -- stop the loop as we have covered all the pages
        $shouldLoop = false;
    }

} while($shouldLoop);

I have tried encoding the whole url, only the query parameters but the same result. I also tried creating new instances of simple_html_dom and then loading the file, no luck. Please help.


Solution

  • You need to html_entity_decode those links, I can see that they are getting mangled by simple-html-dom.

    $url = 'https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes';
    $html = str_get_html(file_get_contents($url));
    
    while($a = $html->find('a.next', 0)){
      $url = html_entity_decode($a->href);
      echo $url . "\n";
      $html = str_get_html(file_get_contents($url));
    }