Search code examples
pythonpandascsvindexingio

Reading csv files from a directory and storing the data to a multi-index Pandas dataframe


I have csv files stored in a folder which contain fundamental data. The folder looks like this:

enter image description here

Once imported as a Pandas dataframe a typical file looks like this:

enter image description here

I would like to build code that will read through a loop the files from the directory and do the following:

1) Set as Index the column 'Quarter end'.

2) Convert the column to datetime object

3) Convert the dataframe to a multi-indexed dataframe which will have a high level index in the axis = 1 dimension with the name 'Company' and the value of the index will be the name of the company of the particular file. (So under the name of a given company will be all the columns containing fundamental data of the particular company. This will be important in the second step where I will merge the files into one.)

4) Merge the files into one file using the index. A key observation is that the indexes do not always match. E.g. for the first five files the first values of the index --set to be 'Quarter end'-- are the following:

Index(['2018-01-31', '2017-10-31', '2017-07-31', '2017-04-30',...
Index(['2018-03-31', '2017-12-31', '2017-09-30', '2017-06-30', '2017-03-31',...
Index(['2018-03-31', '2017-12-31', '2017-09-30', '2017-06-30', '2017-03-31',..
Index(['2017-12-30', '2017-10-07', '2017-07-15', '2017-04-22', '2016-12-31',
Index(['2018-03-31', '2017-12-30', '2017-09-30', '2017-07-01', '2017-04-01',

My approach up to now is the following:

1) Amass the filenames in a list:

import os
os.chdir('D:\\DIGITAL_LIBRARY\\Corporate_fundamental_data\\Version_8_2018')
filenames = os.listdir()

2) Loop over the list and read the files

for file in filenames:

df = pd.read_csv(filenames[file], index_col = False).iloc[:, 1:]

3) Set as index the columb 'Quarter end'

df.set_index('Quarter end', inplace = True)

4) Convert to datetime object

df.index = pd.to_datetime(df.index)

Now I would like to convert it to a multi-indexed object as I said in the beginning and give the appropriate value to the index using the name of the company which will be extracted from the file iterator (e.g., from 'A.csv' will be extracted 'A' which is easy)

  1. Then I will store them to a Dictionary one at a time giving to each a key name which will be the name of the company.

  2. Then I will loop over the dictionary and merge the files using their Indices

  3. Finally I will fill forward the gaps created due to mismatches of the Indices.


Solution

  • How about something like:

    import os
    import pandas as pd
    
    DATA_DIR = '/path/to/csv_file_dir/'
    
    all_companies_df = pd.DataFrame()
    for _company_fn in os.listdir(DATA_DIR):
        _company_df = pd.read_csv(DATA_DIR + _company_fn)
        _company_df.set_index('Quarter End', inplace=True)
        _company_name = _company_fn.split('_')[0]
        _company_df.columns = \
            pd.MultiIndex.from_product([[_company_name], _company_df.columns.tolist()])
    
        all_companies_df = pd.concat((all_companies_df, _company_df), axis=1)