Search code examples

Downloading files from via EDGAR using Python 3.9

I am new to the world of coding, so please bear with me if I misuse terminology or generally do not know what I am talking about. I am doing a research project in which I am trying to scrape public company 10-Ks from via EDGAR. I have read various sources, watched various videos, but I found the below reference to be the most relevant to my project, and quite frankly, it is easy for me to follow along with. The explanation for my code begins on page 194, and the code on page 195. I am first attempting to download the index files (image below), which will help me write a code to get 10-Ks specifically. So, I am in the early stages of my project.

This is just a reference of the paper I am using. It is currently on SSRN, so I realize everyone may not have access. I would upload the PDF, but I don't see that as an option. I listed this purely to show I have a source for what I am doing. I can provide screenshots if necessary.

Anand, V., Bochkay, K., Chychyla, R., & Leone, A. J. (2020). Using Python for text analysis in accounting research. Forthcoming, Foundations and Trends in Accounting.

index file example: enter image description here

Currently, I have two issues: My code doesn’t work as intended and I appear to be getting blocked by I will first discuss the former first and the latter at the end. When I run the below, it should download both 2018 and 2019 index files at the down_direct path. However, this code only grabs 2018 index files.

The log/IDLE shell results below show a “successful” and unsuccessful run. The unsuccessful run makes me think I have been blocked by It is my understanding that certain websites look for requests from urllib.request and may automatically screen for that. However, is researcher friendly as long as you attempt downloads after hours in spaced attempts, both of which I have done (I worked on this from 7pm to 10pm last night and waited 10ish minutes between attempts). So, my questions are

  1. How should I adjust my code to make it run as intended? (i.e., pull all 4 qtrs of the start_year and end_year)

  2. Am I being blocked by If so, can I tweak my code to get around that?

    import os
    import urllib.request
    from pathlib import Path
    def get_index(start_year:int, end_year:int, down_direct:str):
        start_year = 2018
        end_year = 2019
        down_direct = r"C:/Users/Documents/Student Files/~Current Student/~RESEARCH/~First Summer Paper/Data/EDGAR/"
        print('Retrieving data')
        if not os.path.exists(down_direct):
        for year in range(start_year, end_year+1):
            for qtr in range(1,5):
                url = r"" + str(year) + '/' + 'QTR' + str(qtr) + '/master.idx' 
                dl_file = down_direct + 'master' + str(year) + str(qtr) + '.idx'
                urllib.request.urlretrieve(url, dl_file)
            print('Downloaded', dl_file, end = '\n')
            print('Data retrieved')
    down_direct = os.path.join(Path.home(), 'edgar', 'indexfiles')
    get_index(2018, 2019, down_direct)

Successful Run

Retrieving Data

Downloaded C:/Users/Documents/Student Files/~Current Student/~RESEARCH/~First Summer Paper/Data/EDGAR/master20184.idx

Data retrieved

Unsuccessful Run (For sake of space, I only included the error line)

Retrieving Data

urllib.error.HTTPError: HTTP Error 403: Forbidden

I have seen similar posts where people recommend adding the below to code to get around this error, but I am so green I don’t really know how to incorporate it in. Any help is appreciated, and if I need to edit my post with more information, please let me know.

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})


  • import requests
    heads = {'Host': '', 'Connection': 'close',
             'Accept': 'application/json, text/javascript, */*; q=0.01', 'X-Requested-With': 'XMLHttpRequest',
             'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',
    def download(year):
        for qtr in range(1, 5):
            url = f"{year}/QTR{qtr}/master.idx"
            response = requests.get(url, headers=heads)
            down_direct = r"C:/Users/Documents/Student Files/~Current Student/~RESEARCH/~First Summer Paper/Data/EDGAR/"
            with open(f'{down_direct}/master{year}QTR{qtr}.idx', 'wb') as f:
    start_year =2018
    end_year = 2019
    for i in range(start_year,end_year+1):

    enter image description here