Search code examples
phpcurlweb-scrapingsimple-html-dom

Unable to grab content traversing multiple pages


I've written a script in php to scrape the titles and its links from a webpage. The webpage displays it's content traversing multiple pages. My below script can parse the titles and links from it's landing page.

How can I rectify my existing script to get data from multiple pages, as in upto 10 pages?

This is my attempt so far:

<?php
include "simple_html_dom.php";
$link = "https://stackoverflow.com/questions/tagged/web-scraping?page=2";
function get_content($url)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $htmlContent = curl_exec($ch);
    curl_close($ch);
    $dom = new simple_html_dom();
    $dom->load($htmlContent);
    foreach($dom->find('.question-summary') as $file){
        $itemTitle = $file->find('.question-hyperlink', 0)->innertext;
        $itemLink = $file->find('.question-hyperlink', 0)->href;
        echo "{$itemTitle},{$itemLink}<br>";
    }
}
get_content($link);
?>

The site increments it's pages like ?page=2,?page=3 e.t.c.


Solution

  • This is how I got success (coping with Nima's suggestion).

    <?php
    include "simple_html_dom.php";
    $link = "https://stackoverflow.com/questions/tagged/web-scraping?page="; 
    
    function get_content($url)
        {
            $ch = curl_init();
            curl_setopt($ch, CURLOPT_URL, $url);
            curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            $htmlContent = curl_exec($ch);
            curl_close($ch);
            $dom = new simple_html_dom();
            $dom->load($htmlContent);
            foreach($dom->find('.question-summary') as $file){
                $itemTitle = $file->find('.question-hyperlink', 0)->innertext;
                $itemLink = $file->find('.question-hyperlink', 0)->href;
                echo "{$itemTitle},{$itemLink}<br>";
            }
        }
    for($i = 1; $i<10; $i++){
            get_content($link.$i);
        }
    ?>