Search code examples
pythonpandasloopsreport

How to read in multiple files as separate dataframes and perform calculations on a column?


I am calculating a single stock return as follow:

data = pd.read_csv(r'**file**.csv')
data.index = data.Date
data['Return %'] = data['AAPL'].pct_change(-1)*100
data

out put:

    Date    AAPL    Return %
Data            
2020-09-11  2020-09-11  56.00   0.000000
2020-09-10  2020-09-10  56.00   -3.879162
2020-09-09  2020-09-09  58.26   2.138850
2020-09-08  2020-09-08  57.04   -2.211555
2020-09-04  2020-09-04  58.33   0.882048
2020-09-03  2020-09-03  57.82   -3.585126
2020-09-02  2020-09-02  59.97   -0.133222

Now, I have many other csv files as stock symbols saved and I would like to use each of these symbols to perform same calculation above. On top of that, I would like to print a report for the best day of each of these symbols returns.

If more details needed, pls let me know.

Thanks in advance!


Solution

    • I think the best option for your data is to read the files into a dictionary of dataframes.
      • Use pathlib and .glob to create a list of all the files
      • Use a dict comprehension to create the dict of dataframes.
    • The dictionary can be iterated over in the standard way of dictionaries, with dict.items().
    • df_dict[k] addresses each dataframe, where k is the dict key, which is the file name.
    • From your last question, I expect the .csv file to be read in with one Date column, not two.
    • The numeric data for each file should be in the column at index 0, after Date is set as the index.
      • Since the column name is different for each file, it's better to use .iloc to address the column.
      • : means all rows and 0 is the column index for the numeric data.
    • df_dict.keys() will return a list of all the keys
    • Individually access a dataframe with df_dict[key].
    import pandas as pd
    from pathlib import Path
    
    # create the path to the files
    p = Path('c:/Users/<<user_name>>/Documents/stock_files')
    
    # get all the files
    files = p.glob('*.csv')
    
    # created the dict of dataframes
    df_dict = {f.stem: pd.read_csv(f, parse_dates=['Date'], index_col='Date') for f in files}
    
    # apply calculations to each dataframe and update the dataframe
    # since the stock data is in column 0 of each dataframe, use .iloc
    for k, df in df_dict.items():
        df_dict[k]['Return %'] = df.iloc[:, 0].pct_change(-1)*100