Search code examples
phpweb-scrapingsimple-html-dom

Can't parse the titles of some links using function


I've written a script to parse the title of each page after making use of links populated from this url. To be clearer: my below script is supposed to parse all the links from the landing page and then reuse those links in order to go one layer deep and parse the titles of posts from there.

As this is my first ever attempt to write anything in php, I can't figure out where I'm going wrong.

This is my try so far:

<?php
include("simple_html_dom.php");
$baseurl = "https://stackoverflow.com";
function get_links($baseurl)
{
    $weburl = "https://stackoverflow.com/questions/tagged/web-scraping";
    $html   = file_get_html($weburl);
    $processed_links = array();
    foreach ($html->find(".summary h3 a") as $a) {
            $links           = $a->href . '<br>';
            $processed_links[] = $baseurl . $links;

        }
        return implode("\n",$processed_links);
}
function reuse_links($processed_links){
    $ihtml = file_get_html($processed_links);
    foreach ($ihtml -> find("h1 a") as $item) {
        echo $item->innertext;
    }
}
$pro_links = get_links($baseurl);
reuse_links($pro_links);
?>

When I execute the script, it produces the following error:

Warning: file_get_contents(https://stackoverflow.com/questions/52347029/getting-all-the-image-urls-from-a-given-instagram-user<br> https://stackoverflow.com/questions/52346719/unable-to-print-links-in-another-function<br> https://stackoverflow.com/questions/52346308/bypassing-technical-limitations-of-instagram-bulk-scraping<br> https://stackoverflow.com/questions/52346159/pulling-the-href-from-a-link-when-web-scraping-using-python<br> https://stackoverflow.com/questions/52346062/in-url-is-indicated-as-query-or-parameter-in-an-attempt-to-scrap-data-using<br> https://stackoverflow.com/questions/52345850/not-able-to-print-link-from-beautifulsoup-for-web-scrapping<br> https://stackoverflow.com/questions/52344564/web-scraping-data-that-was-shown-previously<br> https://stackoverflow.com/questions/52344305/trying-to-encode-decode-locations-when-scraping-a-website<br> https://stackoverflow.com/questions/52343297/cant-parse-the-titles-of-some-links-using-function<br> https: in C:\xampp\htdocs\differenttuts\simple_html_dom.php on line 75

Fatal error: Uncaught Error: Call to a member function find() on boolean in C:\xampp\htdocs\differenttuts\testfile.php:18 Stack trace: #0 C:\xampp\htdocs\differenttuts\testfile.php(23): reuse_links('https://stackov...') #1 {main} thrown in C:\xampp\htdocs\differenttuts\testfile.php on line 18

Once again: I expect my script to tarck the links from the landing page and parse the titles from it's target page.


Solution

  • I'm not very familiar with simple_html_dom, but I'll try to answer the question. This library uses file_get_contents to preform HTTP requests, but in PHP7 file_get_contents doesn't accept negative offset (which is the default for this library) when retrieving network resources.

    If you're using PHP 7 you'll have set the offset to 0.

    $html = file_get_html($url, false, null, 0);
    

    In your get_links function you join your links to a string. I think it's best to return an array, since you'll need those links for new HTTP requests in the next function. For the same reason you shouldn't add break tags to the links, you can break when you print.

    function get_links($url)
    {
        $processed_links  = array();
        $base_url = implode("/", array_slice(explode("/", $url), 0, 3));
        $html = file_get_html($url, false, null, 0);
        foreach ($html->find(".summary h3 a") as $a) {
            $link = $base_url . $a->href;
            $processed_links[] = $link;
            echo $link . "<br>\n";
        }
        return $processed_links ;
    }
    
    function reuse_links($processed_links)
    {
        foreach ($processed_links as $link) {
            $ihtml = file_get_html($link, false, null, 0);
            foreach ($ihtml -> find("h1 a") as $item) {
                echo $item->innertext . "<br>\n";
            }
        }
    }
    
    $url = "https://stackoverflow.com/questions/tagged/web-scraping";
    $pro_links = get_links($url);
    reuse_links($pro_links);
    

    I think it makes more sense to use the main url as a parameter in get_links, we can get the base url from it. I've used array functions for the base url, but you could use parse_url which is the appropriate function.