I am looking at getting the Balance Sheet data automatically and properly organized for any company using Beautiful Soup.
I am not planning on getting each variable but rather the whole Balance sheet. Originally, I was trying to do many codes to extract the URL for a particular company of my choice.
For Example, suppose I want to get the Balance Sheet data from the following URL:
URL1:'https://www.sec.gov/Archives/edgar/data/1418121/000118518520000213/aple20191231_10k.htm'
or from
URL2:'https://www.sec.gov/Archives/edgar/data/1326801/000132680120000046/form8-k03312020earnings.htm'
I am trying to write a function (suppose it is known as get_balancesheet(URL)
) such that regardless of the URL you will get the Dataframe that contains the balance sheet in an organized manner.
# Import libraries
import requests
import re
from bs4 import BeautifulSoup
I wrote the following function that needs a lot of improvement
def Get_Data_Balance_Sheet(url):
page = requests.get(url)
# Create a BeautifulSoup object
soup = BeautifulSoup(page.content)
futures1 = soup.find_all(text=re.compile('CONSOLIDATED BALANCE SHEETS'))
Table=[]
for future in futures1:
for row in future.find_next("table").find_all("tr"):
t1=[cell.get_text(strip=True) for cell in row.find_all("td")]
Table.append(t1)
# Remove list from list of lists if list is empty
Table = [x for x in Table if x != []]
return Table
Then I execute the following
url='https://www.sec.gov/Archives/edgar/data/1326801/000132680120000013/fb-12312019x10k.htm'
Tab=Get_Data_Balance_Sheet(url)
Tab
Note that this is not what I am planning for to have It is not simply putting it in a dataframe but we need to change it such that regardless of which URL we can get the Balance Sheet.
Well, this being EDGAR it's not going to be simple, but it's doable.
First things first - with the CIK you can extract specific filings of specific types made the CIK filer during a spacific period. So let say you are interested in Forms 10-K and 10-Q, original or amended (as in "FORM 10-K/A", for example), filed by this CIK filer from 2019 through 2020.
start = 2019
end = 2020
cik = 220000320193
short_cik = str(cik)[-6:] #we will need it later to form urls
First we need to get a list of filings meeting these criteria and load it into beautifulsoup:
import requests
from bs4 import BeautifulSoup as bs
url = f"https://www.sec.gov/cgi-bin/srch-edgar?text=cik%3D%{cik}%22+AND+form-type%3D(10-q*+OR+10-k*)&first={start}&last={end}"
req = requests.get(url)
soup = bs(req.text,'lxml')
There are 8 filings meeting the criteria: two Form 10-K and 6 Form 10-Q. Each of these filings has an accession number. The accession number is hiding in the url of each of these filings and we need to extract it to get to the actual target - the Excel file which contains the financial statements which are attached to each specific filing.
acc_nums = []
for link in soup.select('td>a[href]'):
target = link['href'].split(short_cik,1)
if len(target)>1:
acc_num = target[1].split('/')[1]
if not acc_num in acc_nums: #we need this filter because each filing has two forms: text and html, with the same accession number
acc_nums.append(acc_num)
At this point, acc_nums
contains the accession number for each of these 8 filings. We can now download the target Excel file. Obviusly, you can loop through acc_num
and download all 8, but let's say you are only looking for (randomly) the Excel file attached to the third filing:
fs_url = f"https://www.sec.gov/Archives/edgar/data/{short_cik}/{acc_nums[2]}/Financial_Report.xlsx"
fs = requests.get(fs_url)
with open('random_edgar.xlsx', 'wb') as output:
output.write(fs.content)
And there you'll have more than you'll ever want to know about Apple's financials at that point in time...