I have encountered a URL that cannot be opened with urllib.request.urlopen
from the standard library in Python 3.8. By sheer luck, I happened to be experimenting with Selenium and discovered that selenium.webdriver.Chrome
can open this same URL. I would like to understand why this is the case.
Here is a minimal example:
from urllib.request import urlopen, HTTPError
from selenium import webdriver
urls = ("https://yahoo.com",
"https://finance.yahoo.com/quote/IWM?p=IWM",
"https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800")
for url in urls:
print(f"\nopening {url}:")
try:
with urlopen(url) as f:
lines = f.readlines()
n = len(lines)
print(f"retrieved {n} lines.")
except HTTPError as e:
print(e)
print(f"\nretrying {url} with Selenium webdriver:")
options = webdriver.chrome.options.Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
driver.get(url)
lines = driver.page_source.split("\n")
n = len(lines)
print(f"retrieved {n} lines.")
driver.close()
Here is its output:
opening https://yahoo.com:
retrieved 1805 lines.
opening https://finance.yahoo.com/quote/IWM?p=IWM:
retrieved 655 lines.
opening https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800:
HTTP Error 404: Not Found
retrying https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800 with Selenium webdriver:
retrieved 572 lines.
Some sites limit access depending on user agent. You can try to supply user agent to your request:
from urllib.request import urlopen, HTTPError, Request
req = Request(
"https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800",
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
try:
with urlopen(req) as f:
lines = f.readlines()
n = len(lines)
print(f"retrieved {n} lines.")
except HTTPError as e:
print(e)