I have csv files stored in a folder which contain fundamental data. The folder looks like this:
Once imported as a Pandas dataframe a typical file looks like this:
I would like to build code that will read through a loop the files from the directory and do the following:
1) Set as Index the column 'Quarter end'.
2) Convert the column to datetime object
3) Convert the dataframe to a multi-indexed dataframe which will have a high level index in the axis = 1 dimension with the name 'Company' and the value of the index will be the name of the company of the particular file. (So under the name of a given company will be all the columns containing fundamental data of the particular company. This will be important in the second step where I will merge the files into one.)
4) Merge the files into one file using the index. A key observation is that the indexes do not always match. E.g. for the first five files the first values of the index --set to be 'Quarter end'-- are the following:
Index(['2018-01-31', '2017-10-31', '2017-07-31', '2017-04-30',...
Index(['2018-03-31', '2017-12-31', '2017-09-30', '2017-06-30', '2017-03-31',...
Index(['2018-03-31', '2017-12-31', '2017-09-30', '2017-06-30', '2017-03-31',..
Index(['2017-12-30', '2017-10-07', '2017-07-15', '2017-04-22', '2016-12-31',
Index(['2018-03-31', '2017-12-30', '2017-09-30', '2017-07-01', '2017-04-01',
My approach up to now is the following:
1) Amass the filenames in a list:
import os
os.chdir('D:\\DIGITAL_LIBRARY\\Corporate_fundamental_data\\Version_8_2018')
filenames = os.listdir()
2) Loop over the list and read the files
for file in filenames:
df = pd.read_csv(filenames[file], index_col = False).iloc[:, 1:]
3) Set as index the columb 'Quarter end'
df.set_index('Quarter end', inplace = True)
4) Convert to datetime object
df.index = pd.to_datetime(df.index)
Now I would like to convert it to a multi-indexed object as I said in the beginning and give the appropriate value to the index using the name of the company which will be extracted from the file iterator (e.g., from 'A.csv' will be extracted 'A' which is easy)
Then I will store them to a Dictionary one at a time giving to each a key name which will be the name of the company.
Then I will loop over the dictionary and merge the files using their Indices
Finally I will fill forward the gaps created due to mismatches of the Indices.
How about something like:
import os
import pandas as pd
DATA_DIR = '/path/to/csv_file_dir/'
all_companies_df = pd.DataFrame()
for _company_fn in os.listdir(DATA_DIR):
_company_df = pd.read_csv(DATA_DIR + _company_fn)
_company_df.set_index('Quarter End', inplace=True)
_company_name = _company_fn.split('_')[0]
_company_df.columns = \
pd.MultiIndex.from_product([[_company_name], _company_df.columns.tolist()])
all_companies_df = pd.concat((all_companies_df, _company_df), axis=1)