Search code examples
pythonpandasbeautifulsoupyahoo-finance

Web Scraping Yahoo Finance Python


I am trying to gather yahoo finance data for a given ticker symbol in a Dataframe from the income statement, balance sheet, and cash flow reports.(URL's provided below)

I used this function from Balance sheet from using yfinance does not have 'Total Debt' like on Yahoo Finance but it only worked for ticker "AAPL" and nothing else.

                                          2022-09-30    2021-09-30    2020-09-30    2019-09-30
Total Assets                             352755000000  351002000000  323888000000  338516000000
Total Liabilities Net Minority Interest  302083000000  287912000000  258549000000  248028000000
Total Equity Gross Minority Interest      50672000000   63090000000   65339000000   90488000000
Total Capitalization                     149631000000  172196000000  164006000000  182295000000
Common Stock Equity                       50672000000   63090000000   65339000000   90488000000
Net Tangible Assets                       50672000000   63090000000   65339000000   90488000000
Working Capital                          -18577000000    9355000000   38321000000   57101000000
Invested Capital                         170741000000  187809000000  177775000000  198535000000
Tangible Book Value                       50672000000   63090000000   65339000000   90488000000
Total Debt                               120069000000  124719000000  112436000000  108047000000
Net Debt                                  96423000000   89779000000   74420000000   59203000000
Share Issued                              15943425000   16426786000   16976763000   17772944000
Ordinary Shares Number                    15943425000   16426786000   16976763000   17772944000

I would like a more robust web scraper that will work for any ticker and be able to get all 3 of those reports without much modification.

I plan to have a separate functions for each one

import pandas as pd
import requests
from datetime import datetime
from bs4 import BeautifulSoup

def retrieve_balance_sheet(ticker):
    ticker = ticker.upper()

    url = f"https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}"
    header = {'Connection': 'keep-alive',
                'Expires': '-1',
                'Upgrade-Insecure-Requests': '1',
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) \
                AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
                }
        
    r = requests.get(url, headers=header)
    html = r.text
    soup = BeautifulSoup(html, "html.parser")

    div = soup.find_all('div', attrs={'class': 'D(tbhg)'})
    if len(div) < 1:
        print("Fail to retrieve table column header")
        exit(0)

    col = []
    for h in div[0].find_all('span'):
        text = h.get_text()
        if text != "Breakdown":
            col.append( datetime.strptime(text, "%m/%d/%Y") )
    
    df = pd.DataFrame(columns=col)
    for div in soup.find_all('div', attrs={'data-test': 'fin-row'}):
        i = 0
        idx = ""
        val = []
        for h in div.find_all('span') :
            if i == 0:
                idx = h.get_text()
            else:
                num = int(h.get_text().replace(",", "")) * 1000
                val.append( num )
            i += 1
        row = pd.DataFrame([val], columns=col, index=[idx] )
        df = pd.concat([df, row])

    return df

Solution

  • To get the tables from the 3 URLs you can try:

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    urls = [
        'https://finance.yahoo.com/quote/{ticker}/financials?p={ticker}',
        'https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}',
        'https://finance.yahoo.com/quote/{ticker}/cash-flow?p={ticker}'
    ]
    
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'}
    
    def get_soup(url):
        r = requests.get(url, headers=headers)
        return BeautifulSoup(r.content, 'html.parser')
    
    ticker = 'AMZN'
    for url in urls:
        soup = get_soup(url.format(ticker=ticker))
    
        table = soup.select_one('.BdT')
        all_data = []
        for row in table.select('.D\(tbr\)'):
            data = [cell.text for cell in row.select('.Ta\(c\), .Ta\(start\)')]
            all_data.append(data)
    
        df = pd.DataFrame(all_data[1:], columns=all_data[0])
        print(df)
        print()
    

    Prints:

                                                         Breakdown          ttm   12/31/2022   12/31/2021   12/31/2020   12/31/2019
    0                                                Total Revenue  513,983,000  513,983,000  469,822,000  386,064,000  280,522,000
    1                                              Cost of Revenue  446,343,000  446,343,000  403,507,000  334,564,000  241,699,000
    2                                                 Gross Profit   67,640,000   67,640,000   66,315,000   51,500,000   38,823,000
    3                                            Operating Expense   55,392,000   55,392,000   41,436,000   28,601,000   24,282,000
    4                                             Operating Income   12,248,000   12,248,000   24,879,000   22,899,000   14,541,000
    5                    Net Non Operating Interest Income Expense   -1,378,000   -1,378,000   -1,361,000   -1,092,000     -768,000
    6                                         Other Income Expense  -16,806,000  -16,806,000   14,633,000    2,371,000      203,000
    7                                                Pretax Income   -5,936,000   -5,936,000   38,151,000   24,178,000   13,976,000
    8                                                Tax Provision   -3,217,000   -3,217,000    4,791,000    2,863,000    2,374,000
    9                     Earnings from Equity Interest Net of Tax       -3,000       -3,000        4,000       16,000      -14,000
    10                              Net Income Common Stockholders   -2,722,000   -2,722,000   33,364,000   21,331,000   11,588,000
    11                    Diluted NI Available to Com Stockholders   -2,722,000   -2,722,000   33,364,000   21,331,000   11,588,000
    12                                                   Basic EPS            -        -0.27         3.30         2.13         1.17
    13                                                 Diluted EPS            -        -0.27         3.24         2.09         1.15
    14                                        Basic Average Shares            -   10,189,000   10,120,000   10,000,000    9,880,000
    15                                      Diluted Average Shares            -   10,189,000   10,300,000   10,200,000   10,080,000
    16                          Total Operating Income as Reported   12,248,000   12,248,000   24,879,000   22,899,000   14,541,000
    17                                              Total Expenses  501,735,000  501,735,000  444,943,000  363,165,000  265,981,000
    18         Net Income from Continuing & Discontinued Operation   -2,722,000   -2,722,000   33,364,000   21,331,000   11,588,000
    19                                           Normalized Income    7,037,600   -2,722,000   33,364,000   21,331,000   11,588,000
    20                                             Interest Income      989,000      989,000      448,000      555,000      832,000
    21                                            Interest Expense    2,367,000    2,367,000    1,809,000    1,647,000    1,600,000
    22                                         Net Interest Income   -1,378,000   -1,378,000   -1,361,000   -1,092,000     -768,000
    23                                                        EBIT   -3,569,000   -3,569,000   39,960,000   25,825,000   15,576,000
    24                                                      EBITDA   38,352,000            -            -            -            -
    25                                  Reconciled Cost of Revenue  446,343,000  446,343,000  403,507,000  334,564,000  241,699,000
    26                                     Reconciled Depreciation   41,921,000   41,921,000   34,296,000   25,251,000   21,789,000
    27  Net Income from Continuing Operation Net Minority Interest   -2,722,000   -2,722,000   33,364,000   21,331,000   11,588,000
    28                      Total Unusual Items Excluding Goodwill  -16,266,000  -16,266,000   14,652,000            -      203,000
    29                                         Total Unusual Items  -16,266,000  -16,266,000   14,652,000            -      203,000
    30                                           Normalized EBITDA   54,618,000   38,352,000   74,256,000   51,076,000   37,365,000
    31                                          Tax Rate for Calcs            0            0            0            0            0
    32                                 Tax Effect of Unusual Items   -6,506,400            0            0            0            0
    
                                      Breakdown   12/31/2022   12/31/2021   12/31/2020   12/31/2019
    0                              Total Assets  462,675,000  420,549,000  321,195,000  225,248,000
    1   Total Liabilities Net Minority Interest  316,632,000  282,304,000  227,791,000  163,188,000
    2      Total Equity Gross Minority Interest  146,043,000  138,245,000   93,404,000   62,060,000
    3                      Total Capitalization  213,193,000  186,989,000  125,220,000   85,474,000
    4                       Common Stock Equity  146,043,000  138,245,000   93,404,000   62,060,000
    5                 Capital Lease Obligations   72,968,000   67,651,000   52,573,000   39,791,000
    6                       Net Tangible Assets  125,755,000  122,874,000   78,387,000   47,306,000
    7                           Working Capital   -8,602,000   19,314,000    6,348,000    8,522,000
    8                          Invested Capital  213,193,000  186,989,000  125,220,000   85,474,000
    9                       Tangible Book Value  125,755,000  122,874,000   78,387,000   47,306,000
    10                               Total Debt  140,118,000  116,395,000   84,389,000   63,205,000
    11                                 Net Debt   13,262,000   12,524,000            -            -
    12                             Share Issued   10,757,000   10,640,000   10,540,000   10,420,000
    13                   Ordinary Shares Number   10,242,000   10,180,000   10,060,000    9,960,000
    14                   Treasury Shares Number      515,000      460,000      480,000      460,000
    
                                Breakdown          ttm   12/31/2022   12/31/2021   12/31/2020   12/31/2019
    0                 Operating Cash Flow   46,752,000   46,752,000   46,327,000   66,064,000   38,514,000
    1                 Investing Cash Flow  -37,601,000  -37,601,000  -58,154,000  -59,611,000  -24,281,000
    2                 Financing Cash Flow    9,718,000    9,718,000    6,291,000   -1,104,000  -10,066,000
    3                   End Cash Position   54,253,000   54,253,000   36,477,000   42,377,000   36,410,000
    4   Income Tax Paid Supplemental Data    6,035,000    6,035,000    3,688,000    1,713,000      881,000
    5     Interest Paid Supplemental Data    2,142,000    2,142,000    1,772,000    1,630,000    1,561,000
    6                 Capital Expenditure  -63,645,000  -63,645,000  -61,053,000  -40,140,000  -16,861,000
    7                    Issuance of Debt   62,719,000   62,719,000   26,959,000   17,321,000    2,273,000
    8                   Repayment of Debt  -47,001,000  -46,753,000  -20,668,000  -18,425,000  -12,339,000
    9         Repurchase of Capital Stock   -6,000,000   -6,000,000            -            -            -
    10                     Free Cash Flow  -16,893,000  -16,893,000  -14,726,000   25,924,000   21,653,000