Search code examples
pythonpandasyahoo-finance

How to get ALL historical data from Yahoo without specifying a start date in pandas?


I am learning python + pandas for data analysis. I try to program some investment ideas as exercises. pandas has this nice io.data module to pull data from online sources, such as Yahoo and Google. However, they all require a start date, which by default is "2010.01.01", as specified in the following code in data.py

http://github.com/pydata/pandas/blob/master/pandas/io/data.py:

def _sanitize_dates(start, end):
    from pandas.core.datetools import to_datetime
    start = to_datetime(start)
    end = to_datetime(end)
    if start is None:
        start = dt.datetime(2010, 1, 1)
    if end is None:
        end = dt.datetime.today()
    return start, end

Since every stock IPOed at different dates in history, it will be very hard to specify for each ticker. Wouldn't it be nice if there is an option to set pandas to read ALL data? Even for a 50 year old public company, the data is only ~50*200 = 10,000 rows. Python should be OK to handle that, right?

Thank you for your help. And my salute to Wes and other pandas contributors; pandas is great!


Solution

  • A simple solution would be to assume some common start date (before which information would not exist). 1 January 1970 seems like a fair choice.

    In [55]: from pandas.io.data import DataReader
    In [56]: from datetime import datetime
    In [57]: df_1=DataReader("AAPL",  "yahoo", datetime(1970,1,1), datetime(2013,10,1))
    In [58]: df_1
    Out[58]: 
    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 7330 entries, 1984-09-07 00:00:00 to 2013-10-01 00:00:00
    Data columns (total 6 columns):
    Open         7330  non-null values
    High         7330  non-null values
    Low          7330  non-null values
    Close        7330  non-null values
    Volume       7330  non-null values
    Adj Close    7330  non-null values
    dtypes: float64(5), int64(1)
    

    Now, we shall choose the starting date as 1984-09-07 and observe that we pull the same data, thereby, ending with the same DataFrame.

    In [59]: df_2 = DataReader("AAPL",  "yahoo", datetime(1984,9,7), datetime(2013,10,1))
    In [60]: df_2
    Out [60]: 
    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 7330 entries, 1984-09-07 00:00:00 to 2013-10-01 00:00:00
    Data columns (total 6 columns):
    Open         7330  non-null values
    High         7330  non-null values
    Low          7330  non-null values
    Close        7330  non-null values
    Volume       7330  non-null values
    Adj Close    7330  non-null values
    dtypes: float64(5), int64(1)