Scraping historical data from Yahoo Finance with Python

as some of you probably know by now, it seems that Yahoo! Finance has discontinued its API for stock market data. While I am aware of the existence of the fix-yahoo-finance solution, I was trying to implement a more stable solution to my code by directly scraping historical data from Yahoo.

So here is what I have for the moment:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://finance.yahoo.com/quote/AAPL/history?period1=345423600&period2=1495922400&interval=1d&filter=history&frequency=1d")
soup = BeautifulSoup(page.content, 'html.parser')
soup
print(soup.prettify())

To get the data from Yahoo table I can do:

c=soup.find_all('tbody')
print(c)

My question is, how do I turn "c" into a nicer dataframe? Thanks!

Solution

I wrote this to get historical data from YF directly from the download csv link. It needs to make two requests, one to get the cookie and the crumb and another one to get the data. It returns a pandas dataframe

import re
from io import StringIO
from datetime import datetime, timedelta

import requests
import pandas as pd


class YahooFinanceHistory:
    timeout = 2
    crumb_link = 'https://finance.yahoo.com/quote/{0}/history?p={0}'
    crumble_regex = r'CrumbStore":{"crumb":"(.*?)"}'
    quote_link = 'https://query1.finance.yahoo.com/v7/finance/download/{quote}?period1={dfrom}&period2={dto}&interval=1d&events=history&crumb={crumb}'

    def __init__(self, symbol, days_back=7):
        self.symbol = symbol
        self.session = requests.Session()
        self.dt = timedelta(days=days_back)

    def get_crumb(self):
        response = self.session.get(self.crumb_link.format(self.symbol), timeout=self.timeout)
        response.raise_for_status()
        match = re.search(self.crumble_regex, response.text)
        if not match:
            raise ValueError('Could not get crumb from Yahoo Finance')
        else:
            self.crumb = match.group(1)

    def get_quote(self):
        if not hasattr(self, 'crumb') or len(self.session.cookies) == 0:
            self.get_crumb()
        now = datetime.utcnow()
        dateto = int(now.timestamp())
        datefrom = int((now - self.dt).timestamp())
        url = self.quote_link.format(quote=self.symbol, dfrom=datefrom, dto=dateto, crumb=self.crumb)
        response = self.session.get(url)
        response.raise_for_status()
        return pd.read_csv(StringIO(response.text), parse_dates=['Date'])

You can use it like this:

df = YahooFinanceHistory('AAPL', days_back=30).get_quote()