Search code examples
phpcsvcurlweb-scrapingsimple-html-dom

Script writes partial content to a csv file


I've written a script in php to scrape the titles and its links from a webpage and write them accordingly to a csv file. As I'm dealing with a paginated site, only the content of last page remains in the csv file and the rest are being overwritten. I tried with writing mode w. However, when I do the same using append a then I find all the data in that csv file.

As appending and writing data makes the csv file open and close multiple times (because of my perhaps wrongly applied loops), the script becomes less efficient and time consuming.

How can i do the same in an efficient manner and of course using (writing) w mode?

This is I've written so far:

<?php
include "simple_html_dom.php";
$link = "https://stackoverflow.com/questions/tagged/web-scraping?page="; 

function get_content($url)
    {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        $htmlContent = curl_exec($ch);
        curl_close($ch);
        $dom = new simple_html_dom();
        $dom->load($htmlContent);
        $infile = fopen("itemfile.csv","a");
        foreach($dom->find('.question-summary') as $file){
            $itemTitle = $file->find('.question-hyperlink', 0)->innertext;
            $itemLink = $file->find('.question-hyperlink', 0)->href;
            echo "{$itemTitle},{$itemLink}<br>";
            fputcsv($infile,[$itemTitle,$itemLink]);
        }
        fclose($infile);
    }
for($i = 1; $i<10; $i++){
        get_content($link.$i);
    }
?>

Solution

  • If you don't want to open and close the file multiple times, then move the opening script before your for-loop and close it after:

    function get_content($url, $inifile)
    {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        $htmlContent = curl_exec($ch);
        curl_close($ch);
        $dom = new simple_html_dom();
        $dom->load($htmlContent);
        foreach($dom->find('.question-summary') as $file){
            $itemTitle = $file->find('.question-hyperlink', 0)->innertext;
            $itemLink = $file->find('.question-hyperlink', 0)->href;
            echo "{$itemTitle},{$itemLink}<br>";
            fputcsv($infile,[$itemTitle,$itemLink]);
        }
    }
    
    $infile = fopen("itemfile.csv","w");
    
    for($i = 1; $i<10; $i++) {
        get_content($link.$i, $inifile);
    }
    
    fclose($infile);
    ?>