Search code examples
pythonurllib

Why Can't I load this page using Python?


If I use urllib to load this url( https://www.fundingcircle.com/my-account/sell-my-loans/ ) I get a 400 status error.

e.g. The following returns a 400 error

>>> import urllib
>>> f = urllib.urlopen("https://www.fundingcircle.com/my-account/sell-my-loans/")
>>> print f.read()

However, if I copy and paste the url into my browser, I see a web page with the information that I want to see.

I have tried using a try, except, and then reading the error. But the returned data just tells me that the page does not exist. e.g.

import urllib
try:
    f = urllib.urlopen("https://www.fundingcircle.com/my-account/sell-my-loans/")
except Exception as e:
    eString = e.read()
    print eString

Why can't Python load the page?


Solution

  • If Python is given a 404 status then that'd be because the server refuses to give you the page.

    Why that is is difficult to know, because servers are black boxes. But your browser gives the server more than just the URL, it also gives it a set of HTTP headers. Most likely the server alters behaviour based on the contents of one or more of those headers.

    You need to look in your browser development tools and see what your browser sends, then try and replicate some of those headers from Python. Obvious candidates are the User-Agent header, followed by Accept and Cookie headers.

    However, in this specific case, the server is responding with a 401 Unauthorized; you are given a login page. It does this both for the browser and Python:

    >>> import urllib
    >>> urllib.urlopen('https://www.fundingcircle.com/my-account/sell-my-loans/')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 87, in urlopen
        return opener.open(url)
      File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 208, in open
        return getattr(self, name)(url)
      File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 451, in open_https
        return self.http_error(url, fp, errcode, errmsg, headers)
      File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 372, in http_error
        result = method(url, fp, errcode, errmsg, headers)
      File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 683, in http_error_401
        errcode, errmsg, headers)
      File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 381, in http_error_default
        raise IOError, ('http error', errcode, errmsg, headers)
    IOError: ('http error', 401, 'Unauthorized', <httplib.HTTPMessage instance at 0x1066f9a28>)
    

    but Python's urllib doesn't have a handler for the 401 status code and turns that into an exception.

    The response body contains a login form; you'll have to write code to log in here, and presumably track cookies.

    That task would be a lot easier with more specialised tools. You could use robobrowser to load the page, parse the form and give you the tools to fill it out, then post the form for you and track the cookies required to keep you logged in. It is built on top of the excellent requests and BeautifulSoup libraries.