Search code examples
pythondownloadbloburllib

Download file from Blob URL with Python


I wish to have my Python script download the Master data (Download, XLSX) Excel file from this Frankfurt stock exchange webpage.

When to retrieve it with urrlib and wget, it turns out that the URL leads to a Blob and the file downloaded is only 289 bytes and unreadable.

http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx

I'm entirely unfamiliar with Blobs and have these questions:

  • Can the file "behind the Blob" be successfully retrieved using Python?

  • If so, is it necessary to uncover the "true" URL behind the Blob – if there is such a thing – and how? My concern here is that the link above won't be static but actually change often.


Solution

  • That 289 byte long thing might be a HTML code for 403 forbidden page. This happen because the server is smart and rejects if your code does not specify a user agent.

    Python 3

    # python3
    import urllib.request as request
    
    url = 'http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx'
    # fake user agent of Safari
    fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'
    r = request.Request(url, headers={'User-Agent': fake_useragent})
    f = request.urlopen(r)
    
    # print or write
    print(f.read())
    

    Python 2

    # python2
    import urllib2
    
    url = 'http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx'
    # fake user agent of safari
    fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'
    
    r = urllib2.Request(url, headers={'User-Agent': fake_useragent})
    f = urllib2.urlopen(r)
    
    print(f.read())