Search code examples
python-3.xurllib3

How to get the webpage with urllib library?


The webpage can open in my browser.

https://www.sec.gov/files/company_tickers_exchange.json

Add browser user agent when to get the webpage with urllib:

from urllib.request import Request, urlopen
url = "https://www.sec.gov/files/company_tickers_exchange.json"
req = Request(
    url=url, 
    headers={'User-Agent': 'Mozilla/5.0'}
)
webpage = urlopen(req).read()

It run into error:

urllib.error.HTTPError: HTTP Error 403: Forbidden

Although i can get the webpage with playwright:

from playwright.sync_api import sync_playwright as playwright
pw = playwright().start()
browser = pw.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
url = "https://www.sec.gov/files/company_tickers_exchange.json"
page.goto(url)
page.content()

I feel it is a clunky method,how to get the webpage only with urllib?


Solution

  • Judging by the Fair Access section of SEC.gov | Accessing EDGAR Data, passing a normal browser header from a non-browser client (as you've tried to do) will likely be met with a negative response:

    Please declare your user agent in request headers:

    Sample Declared Bot Request Headers:

    [Header] [Value]
    User-Agent: Sample Company Name AdminContact@.com
    Accept-Encoding: gzip, deflate
    Host: www.sec.gov

    Heeding this advice seems to work in my test on Repl.it:

    from urllib.request import Request, urlopen
    url = "https://www.sec.gov/files/company_tickers_exchange.json"
    req = Request(
        url=url, 
        headers={'User-Agent': 'Sean Quinn email@redacted.com'}
    )
    webpage = urlopen(req).read()
    print(webpage)