Search code examples
phphtmlweb-scrapingxpath

Capture Blank Nodes DOM and assign value


Ok i'm a bit of a newbie to DOM but I've managed to cobble together a semi working solution until now.

Using xpath I was looking for key elements within a web page and was looping through each instance which was fine, until I reached a node that is empty.

so when building my array I have say 20 nodes of one element but only 14 of another because the img isn't there all the time.

so in effect I have an array that looks like this

Array
(
[0] => Array
    (
        [item] => PV10923
        [img] => image1.jpg
    )

[1] => Array
    (
        [item] => PV10924
        [img] => image2.jpg
    )

[2] => Array
    (
        [item] => PV10925
        [img] => image3.jpg
    )

[3] => Array
    (
        [item] => PV10926
        [img] => image4.jpg
    )

[4] => Array
    (
        [item] => PV10927
        [img] => 
    )

[5] => Array
    (
        [item] => PV10928
        [img] => 
    )

[6] => Array
    (
        [item] => PV10929
        [img] => 
    )

)

when in reality it should look like this

    Array
   (
[0] => Array
    (
        [item] => PV10923
        [img] => image1.jpg
    )

[1] => Array
    (
        [item] => PV10924
        [img] => image2.jpg
    )

[2] => Array
    (
        [item] => PV10925
        [img] =>  
    )

[3] => Array
    (
        [item] => PV10926
        [img] =>  
    )

[4] => Array
    (
        [item] => PV10927
        [img] => 
    )

[5] => Array
    (
        [item] => PV10928
        [img] => image3.jpg
    )

[6] => Array
    (
        [item] => PV10929
        [img] => Image4.jpg
    )

  )

Now the webpage source code looks like this

<div id="item">
<h2>PV PV10924</h2>
<p>
<a href="http://www.example.com"><img src="image4.jpg">
</p>
</div>
<div id="item">
<h2>PV PV10925</h2>
<p>
&nbsp; (assign a value)
</p>
</div>
<div id="item">
<h2>PV PV10926</h2>
<p>
<a href="http://www.example.com"><img src="image5.jpg">
 </p>
 </div>

Ive been looking all over to see if there is a way to capture the parent then do an if statement to see if the child is present then do the xpath if not assign node value x

Being dyslexic reading isnot my forte but believe me I'm trying...

Can anyone please advise me on the best route/method to achieve this....


Solution

  • You could check for the descendants of the particular element. For example:

    $sample_markup = '<div id="item"><h2>PV PV10924</h2><p><a href="http://www.example.com"><img src="image4.jpg"></a></p></div><div id="item"><h2>PV PV10925</h2><p>&nbsp; (assign a value)</p></div><div id="item"><h2>PV PV10926</h2><p><a href="http://www.example.com"><img src="image5.jpg"></a> </p> </div>';
    // using the sample markup above
    $dom = new DOMDocument();
    libxml_use_internal_errors(true); // handle errors
    $dom->loadHTML($sample_markup);
    libxml_clear_errors();
    $xpath = new DOMXpath($dom);
    
    $data = array();
    $elements = $xpath->query('//div[@id="item"]');
    foreach($elements as $e) {
        $item = $xpath->evaluate('string(.//h2/text())', $e);
        // checking
        $check = $xpath->evaluate('count(.//*[descendant::a])', $e);
        if($check > 0) {
            $image = $xpath->evaluate('string(.//a/img/@src)', $e);
        } else {
            $image = 'test.jpg';
        }
        $data[] = array('item' => $item, 'image' => $image);
    }
    
    echo '<pre>';
    print_r($data);
    

    Sample Output