Search code examples
phpdomdomdocumentgetelementsbytagnamegetattribute

Scraping Links on Webpage Need to Determine if they contain Img elements


I'm building a custom scraper for a project. I can currently scrape all of the links on a webpage, storing the HREF, and anchor text in a database. However I am getting stuck when trying to determine if the anchor element contains and image element.

Here is my code:

foreach($rows as $row) {
    $url = $row['url'];
    $dom = new DOMDocument;
    libxml_use_internal_errors(TRUE); //disable libxml errors
    $dom->loadHTML(file_get_contents($url));

    // Write source page, destination URL and anchor text to the database
    foreach($dom->getElementsByTagName('a') as $link) {
        $href = $link->getAttribute('href');
        $anchor = $link->nodeValue;
        $img = $link->getElementsByTagName('img');
        $imgalt = $img->getAttribute('alt');

I then write the data to the database. This works fine within $img and $imgalt but I really want to identify if the anchor contains an image and also if there is an alt attribute. I know the problem is how I am trying to select the image using getElementsByTagName. I have been Googling all day and trying lots of different suggestions but nothing seems to work. Is this even possible?

I have followed the instructions mentioned here.

There is some progress. I can echo the HTML of images within the anchor elements (if I just echo DOMinnerHTML($link)), but I still can't get the alt attribute. I keep getting "Call to a member function getAttribute() on a non-object".

Here is my code now:

foreach($dom->getElementsByTagName('a') as $link) {
        $href = $link->getAttribute('href');
        $anchor = $link->nodeValue;
        $imgdom = DOMinnerHTML($link);
        $imgalt = $imgdom->getAttribute('alt');
        if(isset($imgalt)){
            echo $imgalt;
        }

Solution

  • Well, I just can suppose you want something like this:

    <?php
    
    $html_fragment = <<<HTML
    <html>
    <head>
        <title></title>
    </head>
    <body>
    <div id="container">
        <a href="#a">there is n image here</a>
        <a href="#b"><img src="path/to/image-b" alt="b: alt content"></a>
        <a href="#c"><img src="path-to-image-c"></a>
        <a href="#d"><img src="path-to-image-d" alt="c: alt content"></a>
    </div>
    </body>
    </html>
    HTML;
    
    
    $dom = new DOMDocument();
    @$dom->loadHTML($html_fragment);
    $links = $dom->getElementsByTagName('a');
    
    foreach ($links as $link) {
        # link contains image child?
        $imgs    = $link->getElementsByTagName('img');
        $has_img = $imgs->length > 0;
    
        if ($has_img) {     
            $has_alt = (bool) $imgs->item(0)->getAttribute("alt");
            # img element has alt attribute?
            if ($has_alt) {
                // do something...
            }
        } else {
            // do something...
        }
    }
    

    Remember, such as said in the PHP doc, DOMElement::getAttribute() returns the value of the attribute, or an empty string if no attribute with the given name is found. So in order to check if a node attribute exists, just check if return value is a empty string.