Search code examples
pythonpython-3.xurllib

Why can't I access Digikey's website through urllib?


I'm following the guide here:

Python3 Urllib Tutorial

Everything works fine for those first few examples:

import urllib.request

html = urllib.request.urlopen('https://arstechnica.com').read()
print(html)

and

import urllib.request

headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:48.0) Gecko/20100101 Firefox/48.0"

req = urllib.request.Request('https://arstechnica.com', headers = headers)
html = urllib.request.urlopen(req).read()
print(html)

But if I replace "arstechnica" with "digikey", that urllib request always times out. But the website is easily accessible through a browser. What's going on?


Solution

  • Most websites will try to defend themselves against unwanted bots. If they detect suspicious traffic, they may decide to stop responding without properly closing the connection (leaving you hanging). Some sites are more sophisticated at detecting bots than than others.

    Firefox 48.0 was released back in 2016, so it will be pretty obvious to Digikey that you are probably spoofing the header information. There are also additional headers that browsers typically send, that your script doesn't.

    In Firefox, if you open the Developer Tools and go to the Network Monitor tab, you can inspect a request to see what headers it sends, then copy these to better mimic the behaviour of a typical browser.

    import urllib.request
    
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Upgrade-Insecure-Requests": "1"
    }
    
    req = urllib.request.Request('https://www.digikey.com', headers = headers)
    html = urllib.request.urlopen(req).read()
    print(html)