Search code examples
phpscreen-scraping

Scraping Google Front Page Results with php


i can with php code Scraping title and url from google search results now how to get descriptions

$url  = 'http://www.google.com/search?hl=en&safe=active&tbo=d&site=&source=hp&q=Beautiful+Bangladesh&oq=Beautiful+Bangladesh';
$html = file_get_html($url);

$linkObjs = $html->find('h3.r a');
foreach ($linkObjs as $linkObj) {
    $title = trim($linkObj->plaintext);
    $link  = trim($linkObj->href);

    // if it is not a direct link but url reference found inside it, then extract
    if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
        $link = $matches[1];
    } else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
        continue;
    }

    echo '<p>Title: ' . $title . '<br />';
    echo 'Link: ' . $link . '</p>';
}

The above code gives the following output

Title: Natural Beauties - Bangladesh Photo Gallery
Link: http://www.photo.com.bd/Beauties/

Now I want the following output

Title: Natural Beauties - Bangladesh Photo Gallery
Link: http://www.photo.com.bd/Beauties/
description : photo.com.bd is a website for creative photographers from Bangladesh, mainly for amateur ... Natural-Beauty-of-Bangladesh_Flower · fishing on ... BEAUTY-4.

Solution

  • include("simple_html_dom.php");
    
    $in = "Beautiful Bangladesh";
    $in = str_replace(' ','+',$in); // space is a +
    $url  = 'http://www.google.com/search?hl=en&tbo=d&site=&source=hp&q='.$in.'&oq='.$in.'';
    
    print $url."<br>";
    
    $html = file_get_html($url);
    
    $i=0;
    $linkObjs = $html->find('h3.r a'); 
    foreach ($linkObjs as $linkObj) {
        $title = trim($linkObj->plaintext);
        $link  = trim($linkObj->href);
    
        // if it is not a direct link but url reference found inside it, then extract
        if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&amp;sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
            $link = $matches[1];
        } else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
            continue;
        }
    
        $descr = $html->find('span.st',$i); // description is not a child element of H3 thereforce we use a counter and recheck.
        $i++;   
        echo '<p>Title: ' . $title . '<br />';
        echo 'Link: ' . $link . '<br />';
        echo 'Description: ' . $descr . '</p>';
    }