How to get the webpage with urllib library?

The webpage can open in my browser.

https://www.sec.gov/files/company_tickers_exchange.json

Add browser user agent when to get the webpage with urllib:

from urllib.request import Request, urlopen
url = "https://www.sec.gov/files/company_tickers_exchange.json"
req = Request(
    url=url, 
    headers={'User-Agent': 'Mozilla/5.0'}
)
webpage = urlopen(req).read()

It run into error:

urllib.error.HTTPError: HTTP Error 403: Forbidden

Although i can get the webpage with playwright:

from playwright.sync_api import sync_playwright as playwright
pw = playwright().start()
browser = pw.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
url = "https://www.sec.gov/files/company_tickers_exchange.json"
page.goto(url)
page.content()

I feel it is a clunky method,how to get the webpage only with urllib?

Solution

Judging by the Fair Access section of SEC.gov | Accessing EDGAR Data, passing a normal browser header from a non-browser client (as you've tried to do) will likely be met with a negative response:

Please declare your user agent in request headers:

Sample Declared Bot Request Headers:

[Header] [Value]

User-Agent: Sample Company Name AdminContact@.com

Accept-Encoding: gzip, deflate

Host: www.sec.gov

[Header]	[Value]
User-Agent:	Sample Company Name AdminContact@.com
Accept-Encoding:	gzip, deflate
Host:	www.sec.gov

Heeding this advice seems to work in my test on Repl.it:

from urllib.request import Request, urlopen
url = "https://www.sec.gov/files/company_tickers_exchange.json"
req = Request(
    url=url, 
    headers={'User-Agent': 'Sean Quinn email@redacted.com'}
)
webpage = urlopen(req).read()
print(webpage)