Search code examples
phpdomcurlgetelementsbytagname

DOM structure, get element by attribute name/value


I see a lot of answers on SO that pertain to the question but either there are slight differences that I couldn't overcome or maybe i just couldn't repeat the processes shown.

What I am trying to accomplish is to use CURL to get the HTML from a Google+ business page, iterate over the HTML and for each review of the business scrape the reviews HTML for display on that businesses non google+ webpage.

Every review shares this parent div structure:

<div class="ZWa nAa" guidedhelpid="userreviews"> .....

Thus i am trying to do a foreach loop based on finding and grabbing the div and innerhtml for each div with attribute: guidehelpid="userreviews"

I am succesfully getting the HTML back via CURL and can parse it when targeting a standard TAG name like "a" or if it had an ID, but iterating over the HTML using the PHP default parser when looking for a attribute name is problematic:

How can I take this successful code below and make it work like intended as shown in the second code which of course is wrong?

WORKING CODE (Finds,gets, echo's all "a" tags in $output)

$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$DOM = new DOMDocument;
@$DOM->loadHTML($output);


foreach($DOM->getElementsByTagName('a') as $link) {
        # Show the <a href>
        echo $link->getAttribute('href');
        echo "<br />";}

THEORETICALLY NEEDED CODE: (Find every review by custom attribute in HTML and echo them)

$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$DOM = new DOMDocument;
@$DOM->loadHTML($output);


foreach($DOM->getElementsByTagName('div[guidehelpid=userreviews]') as $review) {
        echo $review;
        echo "<br />"; }

Any help i correcting this would be appreciated. I would prefer not to use "simple_html_dom" if I can accomplish this without it.


Solution

  • I suggest and you could use an DOMXpath in this case too. Example:

    $url = "https://plus.google.com/+Mcgowansac/about";
    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    $output = curl_exec($curl);
    curl_close($curl);
    
    $dom = new DOMDocument;
    libxml_use_internal_errors(true);
    $dom->loadHTML($output);
    libxml_clear_errors();
    $xpath = new DOMXpath($dom);
    
    $review = $xpath->query('//div[@guidedhelpid="userreviews"]');
    
    if($review->length > 0) { // if it exists
        echo $review->item(0)->nodeValue;
        // echoes
        // John DeRemer reviewed 3 months ago Last fall, we had a major issue with mold which required major ... and so on
    }