Search code examples
pythonpython-3.xurlliburlopen

urlopen only working for certain URLs in Python3


So I'm trying to get the URL of a page in python3...

If I do the following,

from urllib.request import urlopen
html = urlopen("http://google.com/")
html.read()

I get the html as desired. However, if I were to choose a different url, as in the following,

from urllib.request import urlopen
html = urlopen("http://www.stackoverflow.com/")
html.read() 

I get the following error after the second line:

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 153, in urlopen return opener.open(url, data, timeout) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 461, in open response = meth(req, response) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 574, in http_response 'http', request, response, code, msg, hdrs) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 499, in error return self._call_chain(*args) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 433, in _call_chain result = func(*args) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 582, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden

Any ideas why this would be happening and how to fix it?


Solution

  • If you look closer at the error message you'll see that it is a HTTP error and a special one:

    HTTP Error 403: Forbidden
    

    So you talked to the server and got your response back but you don't know why you were denied.

    You can get a more detailed message in an HTML returned by the server with something like this:

    from urllib.request import urlopen
    from urllib.error import HTTPError
    
    try:
        html = urlopen("http://www.stackoverflow.com/")
    except HTTPError as e:
        print(e.read().decode('utf-8'))
    
    html.read()
    

    For me it says:

    <h2 data-translate="what_happened">What happened?</h2>
    <p>The owner of this website (www.stackoverflow.com) has banned your access based on your browser's signature (213702c58d2116a6-ua48).</p>
    

    You can treat HTTPError as a file object (https://docs.python.org/3/library/urllib.error.html#urllib.error.HTTPError):

    Though being an exception (a subclass of URLError), an HTTPError can also function as a non-exceptional file-like return value (the same thing that urlopen() returns). This is useful when handling exotic HTTP errors, such as requests for authentication.