I'm following the guide here:
Everything works fine for those first few examples:
import urllib.request
html = urllib.request.urlopen('https://arstechnica.com').read()
print(html)
and
import urllib.request
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:48.0) Gecko/20100101 Firefox/48.0"
req = urllib.request.Request('https://arstechnica.com', headers = headers)
html = urllib.request.urlopen(req).read()
print(html)
But if I replace "arstechnica" with "digikey", that urllib request always times out. But the website is easily accessible through a browser. What's going on?
Most websites will try to defend themselves against unwanted bots. If they detect suspicious traffic, they may decide to stop responding without properly closing the connection (leaving you hanging). Some sites are more sophisticated at detecting bots than than others.
Firefox 48.0 was released back in 2016, so it will be pretty obvious to Digikey that you are probably spoofing the header information. There are also additional headers that browsers typically send, that your script doesn't.
In Firefox, if you open the Developer Tools and go to the Network Monitor tab, you can inspect a request to see what headers it sends, then copy these to better mimic the behaviour of a typical browser.
import urllib.request
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Upgrade-Insecure-Requests": "1"
}
req = urllib.request.Request('https://www.digikey.com', headers = headers)
html = urllib.request.urlopen(req).read()
print(html)