Search code examples
pythonweb-scrapinghttp-headersurllib

python urllib.request - headers that are likely to work


Working on a little script to fetch info from websites. I'm having trouble with HTTP errors.

req = urllib.request.Request(lnk['href'],
   headers={'User-Agent': 'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'})
page = urllib.request.urlopen(req)

When this triest to fetch, for example, http://www.guru99.com/node-js-tutorial.html I get a long series of errors, ending with 406 Unacceptable:

Traceback (most recent call last):
  File "get_links.py", line 45, in <module>
    page = urllib.request.urlopen(req)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 162, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 471, in open
    response = meth(req, response)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 581, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 509, in error
    return self._call_chain(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 443, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 589, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 406: Not Acceptable

Googling around I have found that I should fix the headers (as I have done above) and lots of tutorials about how to fix the headers. Except - not much actually works.

Is there some set of good headers which are likely to not cause a problem with most sites? Is there some python module someone else has created that already includes commonly-working headers? Is there a good way to retry several times with different headers until you get a good response?

This seems like a problem everybody who does web scraping with Python deals with, and I haven't found a decent solution.


Solution

  • HTTP Error 406 Not acceptable

    The HyperText Transfer Protocol (HTTP) 406 Not Acceptable client error response code indicates that the server cannot produce a response matching the list of acceptable values defined in the request's proactive content negotiation headers, and that the server is unwilling to supply a default representation.

    So I can see the issue is with your both User-Agent: Mozilla/5.0 key and value. Here are the links of the bunch of correct User Agents,

    So change your code to the following,

    headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'})
    

    I know the answer is too late but hope this helps someone else.