I have 12 Excel files. Each is based on a month of the year and therefore each filename ends in 'Month YYYY'. For example, the file for March of 2021 ends in 'March 2021.xlsx'. I want to read each Excel file, select certain columns, drop empty rows, then merge each file into one excel file as a named worksheet. However, I want to search the file's name, identify the month and then rename the second column to say that month.
How do I add a code to have the month of each file be used as the 'new name' for the second column of each df?
Here's an example using two months:
File one: January 2021.xlsx
A | B |
---|---|
1 | x |
3 | x |
File three: February 2021.xlsx
A | B |
---|---|
3 | x |
5 | x |
I want to rename B to represent the month of the respective excel file and then merge to get:
A | January | February |
---|---|---|
1 | x | 0 |
3 | x | x |
5 | 0 | x |
This is what I have done so far.
#Store Excel files in a python list
excel_files = list(Path(DATA_DIR).glob('*.xlsx'))
#Read each file into a dataframe
dfs = []
for excel_file in excel_files:
df = pd.read_excel(excel_file,sheet_name='Sheet1',header=5,usecols='A,F',skipfooter=8)
df.dropna(how='any', axis=0, inplace = True)
df.rename(columns={'old-name': 'new-name'}, inplace=True)
dfs.append(df)
#Compile the list of dataframes to merge
data_frames = [dfs[0], dfs[1],dfs[2] ... dfs[11]]
#Merge all dataframes
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['A'],
how='outer'), data_frames).fillna(0)
I need help adding the code to have the month of each file be used as the 'new name' for the second column of each df?
I think your question is similar to this: Extract month, day and year from date regex
an advanced way to do this would be using regex, which is laid out a little in that prior post.
a simpler way to do this would be to split (or rsplit) the on (' '), assuming that there is a space in front of the month as well as after: excel_file = "first bit of names MONTH 2021.xlsx":
for excel_file in excel_files:
new-name = str(excel_file).rsplit(' ', 2)[-2] # creates a list [[first bit of names...], [MONTH], [2021.xlsx]] and takes the 2nd to last element
df = pd.read_excel(excel_file,sheet_name='Sheet1',header=5,usecols='A,F',skipfooter=8)
df.dropna(how='any', axis=0, inplace = True)
df.rename(columns={'old-name': new-name}, inplace=True)
dfs.append(df)
I think this answers the question, but you may have another problem getting the "old name" in the way you propose. Hope this helps
*edited to match comment