Search code examples
pandasgoogle-financequandlpandas-datareader

Pandas DataReader work-around for Google finance


I pull historical data for a large universe of stocks and ETFs daily. Quandl has pretty good free coverage of US Equities, but they do not have historical data for ETFs so I use the Google API as a backup for Quandl.

The recent Google finance "renovation" hasn't left me with a great alternative, so I am trying to apply Brad Solomon's work (thanks Brad, link below) to a list of symbols. Assume it is unlikely without a loop given that he is creating URLs. Any clever ideas welcome.

Related question: How come pandas_datareader for google doesn't work?

Thanks.


Solution

  • Under the hood, pandas-datareader is looping through each symbol that you pass and making http requests one by one.

    Here's the function that does that in the base class, from which the google- and yahoo-related classes inherit: base._DailyBaseReader._dl_mult_symbols.

    The magic is that these are appended to a list and then aggregated into a pandas Panel.

    I would note, however, that Panel is deprecated and you can get the same functionality in a DataFrame with a MultiIndex, a structure that's technically 2-dimenionsal but replicates higher dimensionalities in practice.

    So, here's the barebones of what you could do, below. Please note I'm skipping a lot of the functionality embedded within the package itself, such as parsing string dates to datetime.

    import datetime
    from io import StringIO
    
    import requests
    from pandas.io.common import urlencode
    import pandas as pd
    
    BASE = 'http://finance.google.com/finance/historical'
    
    
    def get_params(sym, start, end):
        params = {
            'q': sym,
            'startdate': start.strftime('%Y/%m/%d'),
            'enddate': end.strftime('%Y/%m/%d'),
            'output': "csv"
        }
        return params
    
    
    def build_url(sym, start, end):
        params = get_params(sym, start, end)
        return BASE + '?' + urlencode(params)
    
    
    def get_one_data(sym, start=None, end=None):
        if not start:
            start = datetime.datetime(2010, 1, 1)
        if not end:
            end = datetime.datetime.today()
        url = build_url(sym, start, end)
        data = requests.get(url).text
        return pd.read_csv(StringIO(data), index_col='Date',
                           parse_dates=True).sort_index()
    
    
    def get_multiple(sym, start=None, end=None, return_type='Panel'):
        if isinstance(sym, str):
            return get_one_data(sym, start=start, end=end)
        elif isinstance(sym, (list, tuple, set)):
            res = {}
            for s in sym:
                res[s] = get_one_data(s, start, end)
            # The actual module also implements a 'passed' and 'failed'
            #     check here and also using chunking to get around
            #     data retreival limits (I believe)
    
        if return_type.lower() == 'panel':
            return pd.Panel(res).swapaxes('items', 'minor')
        elif return_type.lower() == 'mi':  # MultiIndex DataFrame
            return pd.concat((res), axis=1)
    

    An example:

    syms = ['AAPL', 'GE']
    data = get_multiple(syms, return_type='mi')
    
    # Here's how you would filter down to Close prices
    #   on MultiIndex columns
    data.xs('Close', axis=1, level=1) 
    
                  AAPL     GE
    Date                     
    2010-01-04   30.57  15.45
    2010-01-05   30.63  15.53
    2010-01-06   30.14  15.45
    2010-01-07   30.08  16.25
    2010-01-08   30.28  16.60
    ...