I'm trying to automate web scraping on SEC / EDGAR financial reports, but getting HTTP Error 403: Forbidden. I've referred to similar Stack Overflow forms and have changed the code accordingly, but no luck so far.
test_URL = https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt
Code that I'm working with
import urllib
def get_data(link):
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib.request.Request(link,headers=hdr)
page = urllib.request.urlopen(req, timeout=10)
content = page.read().decode('utf-8')
return content
data = get_data(test_URL)
getting the error
HTTPError Traceback (most recent call last)
return result
~\Anaconda3n\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
647 class HTTPDefaultErrorHandler(BaseHandler):
648 def http_error_default(self, req, fp, code, msg, hdrs):
--> 649 raise HTTPError(req.full_url, code, msg, hdrs, fp)
650
651 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 403: Forbidden
I've also tried using requests.get(test_URL) then using BeautifulSoup, but that doesn't return the whole text. Is there any other approach we could follow?
I had no probelems using the request
package. I did need to add user-agent as without, I was getting the same issue as you. Try this:
import requests
test_URL = 'https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt'
def get_data(link):
hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}
req = requests.get(link,headers=hdr)
content = req.content
return content
data = get_data(test_URL)