Search code examples
pythonyahoo-finance

Scraping historical data from Yahoo Finance with Python


as some of you probably know by now, it seems that Yahoo! Finance has discontinued its API for stock market data. While I am aware of the existence of the fix-yahoo-finance solution, I was trying to implement a more stable solution to my code by directly scraping historical data from Yahoo.

So here is what I have for the moment:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://finance.yahoo.com/quote/AAPL/history?period1=345423600&period2=1495922400&interval=1d&filter=history&frequency=1d")
soup = BeautifulSoup(page.content, 'html.parser')
soup
print(soup.prettify())

To get the data from Yahoo table I can do:

c=soup.find_all('tbody')
print(c)

My question is, how do I turn "c" into a nicer dataframe? Thanks!


Solution

  • I wrote this to get historical data from YF directly from the download csv link. It needs to make two requests, one to get the cookie and the crumb and another one to get the data. It returns a pandas dataframe

    import re
    from io import StringIO
    from datetime import datetime, timedelta
    
    import requests
    import pandas as pd
    
    
    class YahooFinanceHistory:
        timeout = 2
        crumb_link = 'https://finance.yahoo.com/quote/{0}/history?p={0}'
        crumble_regex = r'CrumbStore":{"crumb":"(.*?)"}'
        quote_link = 'https://query1.finance.yahoo.com/v7/finance/download/{quote}?period1={dfrom}&period2={dto}&interval=1d&events=history&crumb={crumb}'
    
        def __init__(self, symbol, days_back=7):
            self.symbol = symbol
            self.session = requests.Session()
            self.dt = timedelta(days=days_back)
    
        def get_crumb(self):
            response = self.session.get(self.crumb_link.format(self.symbol), timeout=self.timeout)
            response.raise_for_status()
            match = re.search(self.crumble_regex, response.text)
            if not match:
                raise ValueError('Could not get crumb from Yahoo Finance')
            else:
                self.crumb = match.group(1)
    
        def get_quote(self):
            if not hasattr(self, 'crumb') or len(self.session.cookies) == 0:
                self.get_crumb()
            now = datetime.utcnow()
            dateto = int(now.timestamp())
            datefrom = int((now - self.dt).timestamp())
            url = self.quote_link.format(quote=self.symbol, dfrom=datefrom, dto=dateto, crumb=self.crumb)
            response = self.session.get(url)
            response.raise_for_status()
            return pd.read_csv(StringIO(response.text), parse_dates=['Date'])
    

    You can use it like this:

    df = YahooFinanceHistory('AAPL', days_back=30).get_quote()