I'm trying to scrape some data from google patents, and the beginning of my code looks like this: (here is the hyperlink to the url listed below)
In [1]: import urllib2
In [2]: url='http://www.google.com/search?tbo=p&q=ininventor:\"{}\"&hl=en&tbm=pts&source=lnt&tbs=ptso:us'.format('John-Mudd')
In [3]: print url
In [4]: page=urllib2.urlopen(url)
Which throws the error message:
C:\Python27\lib\urllib2.pyc in urlopen(url, data, timeout)
124 if _opener is None:
125 _opener = build_opener()
--> 126 return _opener.open(url, data, timeout)
128 def install_opener(opener):
C:\Python27\lib\urllib2.pyc in open(self, fullurl, data, timeout)
404 for processor in self.process_response.get(protocol, []):
405 meth = getattr(processor, meth_name)
--> 406 response = meth(req, response)
408 return response
C:\Python27\lib\urllib2.pyc in http_response(self, request, response)
517 if not (200 <= code < 300):
518 response = self.parent.error(
--> 519 'http', request, response, code, msg, hdrs)
521 return response
C:\Python27\lib\urllib2.pyc in error(self, proto, *args)
442 if http_err:
443 args = (dict, 'default', 'http_error_default') + orig_args
--> 444 return self._call_chain(*args)
446 # XXX probably also want an abstract factory that knows when it makes
C:\Python27\lib\urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
376 func = getattr(handler, meth_name)
--> 378 result = func(*args)
379 if result is not None:
380 return result
C:\Python27\lib\urllib2.pyc in http_error_default(self, req, fp, code, msg, hdrs)
525 class HTTPDefaultErrorHandler(BaseHandler):
526 def http_error_default(self, req, fp, code, msg, hdrs):
--> 527 raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
529 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 403: Forbidden
Not sure why I'm getting this.
trying it with urllib.openurl
instead gets me a little further:
In [1]: from bs4 import BeautifulSoup
In [2]: import urllib
In [3]: url='https://www.google.com/search?tbo=p&q=ininventor:"Alan-Mudd"&hl=en&tbm=pts&source=lnt&tbs=ptso:us'
In [4]: print url
In [5]: page=urllib.urlopen(url)
In [6]: txt=BeautifulSoup(page).get_text()
In [7]: txt
Out[7]: u'htmlError 403 (Forbidden)!!1*{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}403. That\u2019s an error.Your client does not have permission to get URL /search?tbo=p&q=ininventor:%22John-Mudd%22&hl=en&tbm=pts&source=lnt&tbs=ptso:us from this server. (Client IP address:\nPlease see Google\'s Terms of Service posted at http://www.google.com/terms_of_service.html\nIf you believe that you have received this response in error, please report your problem. However, please make sure to take a look at our Terms of Service (http://www.google.com/terms_of_service.html). In your email, please send us the entire code displayed below. Please also send us any information you may know about how you are performing your Google searches-- for example, "I\'m using the Opera browser on Linux to do searches from home. My Internet access is through a dial-up account I have with the FooCorp ISP." or "I\'m using the Konqueror browser on Linux to search from my job at myFoo.com. My machine\'s IP address is, but all of myFoo\'s web traffic goes through some kind of proxy server whose IP address is" (If you don\'t know any information like this, that\'s OK. But this kind of information can help us track down problems, so please tell us what you can.)We will use all this information to diagnose the problem, and we\'ll hopefully have you back up and searching with Google again quickly!\nPlease note that although we read all the email we receive, we are not always able to send a personal response to each and every email. So don\'t despair if you don\'t hear back from us!\nAlso note that if you do not send us the entire code below, we will not be able to help you.Best wishes,The Google Team/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/\nDVEH8IymbCoo1dGrTzyT1iwqSLxjtFu0V4uU5kLrZ7OjChn7z\nLh5w5aLlP6v5piIZSC7_8OTKEepHBTROurNwIOVtc7sH0UTJL\n6mOs5-a-s4X63WfAUM064ZFl9JGwBR6VMKvdyMQOoYo7WtAGI\nbcVTsj7H3uWatFa4O9Zuxs7IcRQLOCiWwwhQs-s2AoAiFKlGN\nyXaTCO8GfhXxBt5JYCrTx-mkyrtqDqG_yvNu-fPYTf7V7jLNK\ntgwnPMBejraU_xbwSzoWNx2z7SfDmbPncbwSAMNnZ2CfiMhp8\ns1LQK90rg5hYAgbLmoVjMHMZ0WeRoay-XoB1oKQzff-nnkAEy\nuULx-MidjfVeuQfChSHMY3HUZ13vvzOsJZUjF-GH_-uymoPRG\n5RUBeodyOO3x9cJ-0mvHC_TyAwog10cRwaGKdS-DO92moJem0\nEoUKjmHuF4wXPcbGlSh_GXC9rFM07K6ZR4DxrV27iRaBZmen_\naw_l0qXlfK8quX7qAJT9W2EcrDRDYZdiNnBw7DdpLGeTCK76E\n0KCimiCY1uKC6kkdbGfFjQPK0R-_8DtBE5k7_MwgPR5O-sT0w\nf-ZH0vyEHSor4N8ZCogRMH_mR9L8hB2vrT5HWmYNJbLxS3SjB\nZHeL2vErN6jDFdpTFN7rPKU3-hnP-3zevYMUhHMFSPsi9ShZ7\nddrhqBhbdzifrwC4RgGbsqKTMMUERaoRJC9jj4jrNd14PlOpa\nztAa_82MQ1FhUswXO0EJ6dOHL6NknoBWOYN2-IFT_7cvAbxV6\nofoYL_y5WihMeZpDBPnpRyhjxjAefxNdzA5h9bE5GqV9ZoS92\n4q3Q81-0WK0kmloyf019Y5fI8Ln7ooJFzNpW5Fa7ezHhJ1Yxh\nHLNlD8dLFZogHDrtHsvWOzPWjYESdflsnJ5TjSijnt8ZGF_eT\ncZy50Pt4AFMsVUC3Dn4jkkzv-tok_1WgLKrEqpzzc55Hc4fOq\n9zdSYk52EH0R4__7fJ0w8ZfGmU4x1qGGDatZNpJRpSpLIJjXw\ntJPiGXllFPqQIFfWjIk3WubYKUJOHW37IyIJFjT-yVn6YgESl\nVe2nKpc1FBV1lSyhz5aW-QZtu_tCgPfG4gbfUCPYk54XBgNL4\n6a034xtD5a3rlRw1_ZnBCi7962YybZhX9MKXq5x6Au-y3Fqgg\nxzqicRlQ9UUso0fQ4JJRrLv57OuS2VvpaDCvN8pU1YQOSQWeX\niD1eqxMVoQ35ZaoCYlr-SBaRPuiwett9Fk6EZkvEWL1JqAiQq\n6k_PQ7hoISsoBSYSg1ztYV43JFfZLt1PE_geCPOb7XgUE5rVf\nQPHQX48cKjZmlrzYUyXS_BGSqZOPxZoj7ANivSn5vE88b74wH\ndoybElt_BVmigcY
The error message in it's entirety is shown in this image.
It seems Google are blocking some crawlers.
codingatty pointed out it doesn't work when the user agent string is 'Python'
Base on my experiment, the following user agent strings do not work either(Obviously, this list is incomplete).
Since urllib2
's default user agent string is 'Python-urllib/2.7'
(on Python 2.7), you need to set User-Agent
header to a common web browser or a fake one.
For example:
import urllib2
url = 'http://www.google.com/search?tbo=p&q=ininventor:"John-Mudd"&hl=en&tbm=pts&source=lnt&tbs=ptso:us'
req = urllib2.Request(url, headers={'User-Agent' : "foobar"})