Search code examples
pythonseleniumurllib

Why can a Selenium webdriver open a URL that the standard Python urlopen function cannot?


I have encountered a URL that cannot be opened with urllib.request.urlopen from the standard library in Python 3.8. By sheer luck, I happened to be experimenting with Selenium and discovered that selenium.webdriver.Chrome can open this same URL. I would like to understand why this is the case.

Here is a minimal example:

from urllib.request import urlopen, HTTPError
from selenium import webdriver

urls = ("https://yahoo.com",
        "https://finance.yahoo.com/quote/IWM?p=IWM",
        "https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800")

for url in urls:
    print(f"\nopening {url}:")
    try:
        with urlopen(url) as f:
            lines = f.readlines()
        n = len(lines)
        print(f"retrieved {n} lines.")
    except HTTPError as e:
        print(e)

print(f"\nretrying {url} with Selenium webdriver:")
options = webdriver.chrome.options.Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
driver.get(url)
lines = driver.page_source.split("\n")
n = len(lines)
print(f"retrieved {n} lines.")
driver.close()

Here is its output:

opening https://yahoo.com:
retrieved 1805 lines.

opening https://finance.yahoo.com/quote/IWM?p=IWM:
retrieved 655 lines.

opening https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800:
HTTP Error 404: Not Found

retrying https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800 with Selenium webdriver:
retrieved 572 lines.

Solution

  • Some sites limit access depending on user agent. You can try to supply user agent to your request:

    from urllib.request import urlopen, HTTPError, Request
    
    req = Request(
        "https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800", 
        data=None, 
        headers={
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
        }
    )
    
    try:
        with urlopen(req) as f:
            lines = f.readlines()
        n = len(lines)
        print(f"retrieved {n} lines.")
    except HTTPError as e:
        print(e)