Search code examples
phpsimple-html-dom

How can use PHP(simple_html_dom) for Google Patents?


I would like to get the result for google patents, anyone can help?

This is a example from google search,

<?php
require_once('simple_html_dom.php');
$url  = 'https://www.google.com/search?hl=en&q=facebook&num=1';
$html = file_get_html($url);
$linkObjs = $html->find('h3.r a');

foreach ($linkObjs as $linkObj) {
    $title = trim($linkObj->plaintext);
    $link  = trim($linkObj->href);

    // if it is not a direct link but url reference found inside it, then extract
    if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&amp;sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
        $link = $matches[1];
    } else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
        continue;    
    }

    echo '<p>Title: ' . $title . '<br />';
    echo 'Link: ' . $link . '</p>';    
}

?>

Result:

Title: Welcome to Facebook - Log In, Sign Up or Learn More
Link: https://www.facebook.com/

I like this result but I need to search for Google Patents.

If there are other better choices / methods, please tell me, very grateful.


Solution

  • If you are looking for patent on "multifunctional keypad" set $url as "https://www.google.com/search?tbm=pts&hl=en&q=multi+function+keypad&num=1"

    but remember if you are looking for patent on something that is not available on that site you might get result from some other site or may not even get a result. you will need to handle these situations. (e.g. check if the result have www.google.com/patents/ in it).

    Much more effective way to search would be using google api. search for patent and php on https://developers.google.com/web-search/docs/

    hope this helps

    Update: I wrote a little script to show, it can work with what I said. I didn't wanted to learn simple_html_dom.php, so didn't use that. You may apparently figure out if you could improve my code using that simple_html_dom.php.

    Sometime it needs couple of refreshes for it to work (In my code it picks an random IP that google doesn't treat valid and returns no result, feel free to use your ip, but that might soon get blocked if you run this too frequent, Randomizing IP may still not prevent blocking your ip if run too frequently(google asks to enter captha if it finds scraping like things), I also randomizing few other things like http header and user agent). well here is the code

    <?php
    
    function searchGooglePatent($searchString){
            $url = "https://www.google.com/search?tbm=pts&hl=en&q=".rawurlencode($searchString);//."&num=1"; // add &num=1 if you need only one result
            echo $url;
            $html = geturl($url);
            $ids = match_all('/<a.*?href=\"(https:\/\/www\.google\.com\/patents\/\w\w\d+)\?.*?\".*?>.*?<\/a>/ms', $html, 1);
            return $ids;
        }
    
    function match_all($regex, $str, $i = 0){
            if(preg_match_all($regex, $str, $matches) === false) {
                return false;
            } else {
                return $matches[$i];
            }
        }
    
    
    function geturl($url){
            $ch = curl_init();
            curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
            curl_setopt($ch, CURLOPT_URL, $url);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
            $ip=rand(0,255).'.'.rand(0,255).'.'.rand(0,255).'.'.rand(0,255);
            echo "<br>".$ip."<br>";
            curl_setopt($ch, CURLOPT_HTTPHEADER, array("REMOTE_ADDR: $ip", "HTTP_X_FORWARDED_FOR: $ip"));
            curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/".rand(3,5).".".rand(0,3)." (Windows NT ".rand(3,5).".".rand(0,2)."; rv:2.0.1) Gecko/20100101 Firefox/".rand(3,5).".0.1");
            set_time_limit(90);
            $html = curl_exec($ch);
            curl_close($ch);
            return $html;
        }
    
    $searchResult = searchGooglePatent("Multi function keypad");
    echo "<pre>";
    var_dump($searchResult);
    echo "</pre>";
    
    ?>
    

    Result page would look like this

        https://www.google.com/search?tbm=pts&hl=en&q=Multi%20function%20keypad
        71.10.79.131
        array (size=4)
          0 => string 'https://www.google.com/patents/US7724240' (length=40)
          1 => string 'https://www.google.com/patents/US6876312' (length=40)
          2 => string 'https://www.google.com/patents/US8259073' (length=40)
          3 => string 'https://www.google.com/patents/US7523862' (length=40)