Search code examples
pythonurllib2

Why does urllib2's .getcode() method crash on 404's?


In the beginner Python course I took on Lynda it said to use .getcode() to get the http code from a url and that that can be used as a test before reading the data:

webUrl = urllib2.urlopen('http://www.wired.com/tag/magazine-23-05/page/4')
print(str(webUrl.getcode()))
if (webURL.getcode() == 200):
    data = webURL.read()
else:
    print 'error'

However, when used with the 404 page above it causes Python to quit: Python function terminated unexpectedly: HTTP Error 404: Not Found, so it seems this lesson was completely wrong?

My question then is what exactly is .getcode() actually good for? You can't actually use it to test what the http code is unless you know what it is (or at least that it's not a 404). Was the course wrong or am I missing something?

My understanding is the proper way to do it is like this, which doesn't use .getcode() at all (though tell me if there is a better way):

try:
    url = urllib2.urlopen('http://www.wired.com/tag/magazine-23-05/page/4')
except urllib2.HTTPError, e:
    print e

This doesn't use .getcode() at all. Am I misunderstanding the point of .getcode() or is it pretty much useless? It seems strange to me a method for getting a page code in a library dedicated to opening url's can't handle something as trivial as returning a 404.


Solution

  • A 404 code is considered an error status by urllib2 and thus an exception is raised. The exception object also supports the getcode() method:

    >>> import urllib2
    >>> try:
    ...     url = urllib2.urlopen('http://www.wired.com/tag/magazine-23-05/page/4')
    ... except urllib2.HTTPError, e:
    ...     print e
    ...     print e.getcode()
    ...
    HTTP Error 404: Not Found
    404
    

    The fact that errors are raised is poorly documented. The library uses a stack of handlers to form a URL opener (created with (urllib2.build_opener(), installed with urllib2.install_opener()), and in the default stack the urllib2.HTTPErrorProcessor class is included.

    It is that class that causes anything response with a response code outside the 2xx range to be handled as an error. The 3xx status codes then are handled by the HTTPRedirectHandler object, and some of the 40x codes (related to authentication) are handled by specialised authentication handlers, but most codes simply are left to be raised as an exception.

    If you are up to installing additional Python libraries, I recommend you install the requests library instead, where error handling is a lot saner. No exceptions are raised unless you explicitly request it:

    import requests
    
    response = requests.get(url)
    response.raise_for_status()  # raises an exception for 4xx or 5xx status codes.