Search code examples
pythoncsvpandasdataframepandas-datareader

Pandas .join not working to combine S&P500 stock data


I'm following the finance tutorials on PythonProgramming.net and have run into an issue when I try to combine several dataframes into one large dataframe. I created a function to do this:

def compile_data():
    with open ("sp500tickers.pickle", "rb") as f:
        tickers = pickle.load(f)

    main_df = pd.DataFrame()

    for count,ticker in enumerate(tickers):
        try:
            df = pd.read_csv('stock_dfs/{}.csv'.format(ticker))
            df.set_index('Date', inplace=True)
            df.rename(columns={'Close':ticker}, inplace=True)
            df.drop(['Open','High','Low','Volume'], 1, inplace=True)
            if main_df.empty:
                main_df = df
            else:
                main_df.join(df, how='outer')
                print(main_df.head())
            if count % 10 == 0:
                print(count)
        except Exception:
            pass

    print(main_df.head())
    main_df.to_csv('sp500joinedcloses.csv')

(The reason I used the try/except in the above code was because I have a list of all the tickers for the S&P500, but wasn't able to grab data from Google Finance API for all of them... so this way, if it tries to find a csv that I don't have, it will still combine the ones I do have without throwing an error.)

When I run this function, it creates a CSV called sp500joinedcloses.csv, but it only contains the data for one ticker, namely ABBV. I know it is iterating through the tickers properly, because if I add a print(ticker) in the for loop, all the correct tickers are printed.

It's also worth noting that ABBV isn't the first csv that I have that should be included in the dataframe. They first ticker that should have a file is AAPL, and then ABBV. No idea why it seems to skip AAPL.

I would appreciate any help. I am a beginner to pandas, and really want to learn everything I can about it.


Solution

  • IIUC:

    You don't want to use join and you couldn't anyway if you start with an empty dataframe. Use pd.concat instead:

    main_df = pd.concat([main_df, df], axis=1)
    

    However, I would recommend this to replace your whole process:

    def read_file(ticker):
        df = pd.read_csv('stock_dfs/{}.csv'.format(ticker)).set_index('Date')
        return df.Close.rename(ticker)
    
    with open ("sp500tickers.pickle", "rb") as f:
        tickers = pickle.load(f)
    
    main_df = pd.concat([read_file(t) for t in tickers], axis=1)