Working on a little script to fetch info from websites. I'm having trouble with HTTP errors.
req = urllib.request.Request(lnk['href'],
headers={'User-Agent': 'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'})
page = urllib.request.urlopen(req)
When this triest to fetch, for example, http://www.guru99.com/node-js-tutorial.html
I get a long series of errors, ending with 406 Unacceptable:
Traceback (most recent call last):
File "get_links.py", line 45, in <module>
page = urllib.request.urlopen(req)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 162, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 471, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 581, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 509, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 443, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 589, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 406: Not Acceptable
Googling around I have found that I should fix the headers (as I have done above) and lots of tutorials about how to fix the headers. Except - not much actually works.
Is there some set of good headers which are likely to not cause a problem with most sites? Is there some python module someone else has created that already includes commonly-working headers? Is there a good way to retry several times with different headers until you get a good response?
This seems like a problem everybody who does web scraping with Python deals with, and I haven't found a decent solution.
HTTP Error 406 Not acceptable
The HyperText Transfer Protocol (HTTP) 406 Not Acceptable client error response code indicates that the server cannot produce a response matching the list of acceptable values defined in the request's proactive content negotiation headers, and that the server is unwilling to supply a default representation.
So I can see the issue is with your both User-Agent: Mozilla/5.0
key and value. Here are the links of the bunch of correct User Agents,
So change your code to the following,
headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'})
I know the answer is too late but hope this helps someone else.