I have a strange request, easy to understand but should be very complicated to realize in my opinion. I have a dataframe composed of several spreadsheet excel (using df concat).
I would like to use all the df names (the excel spreadsheet name) to name all my rows, to know in which spreadsheet is coming from this row. Because thereafter, I will mix those datas.
Thank you for your time/help :)
Use df name as rows name.
Suppose we have an Excel workbook like this:
We can use pd.read_excel
with sheet_name=None
to read this into a dictionary with the sheet names as keys:
import pandas as pd
file_name = 'myfile.xlsx'
dict_df = pd.read_excel(file_name, sheet_name=None)
dict_df
{'Sheet1': Col1 Col2
0 A 1
1 B 2
2 C 3,
'Sheet2': Col1 Col2
0 D 4
1 E 5
2 F 6}
Now, we can use pd.concat
to add the dictionary keys to the index values. As of pd 2.0.0
you can use df.add_prefix
with axis=0
:
# pd >= 2.0.0
df = pd.concat([v.add_prefix(f"{k}_", axis=0) for k, v in dict_df.items()])
df
Col1 Col2
Sheet1_0 A 1
Sheet1_1 B 2
Sheet1_2 C 3
Sheet2_0 D 4
Sheet2_1 E 5
Sheet2_2 F 6
For earlier versions, you can overwrite the index using df.set_index
:
# pd < 2.0.0
df = pd.concat([v.set_index(f"{k}_" + v.index.astype(str))
for k, v in dict_df.items()])
# same result
But it may be nicer to add the sheet names as a separate level, making use of the keys
and names
parameters:
df = pd.concat([v for v in dict_df.values()],
keys=dict_df.keys(),
names=['Sheets', 'Rows'])
df
Col1 Col2
Sheets Rows
Sheet1 0 A 1
1 B 2
2 C 3
Sheet2 0 D 4
1 E 5
2 F 6
Or simply:
df = pd.concat(dict_df, names=['Sheets', 'Rows'])
# same result
E.g., this way you can easily select a particular sheet:
df.loc['Sheet1']
Col1 Col2
Rows
0 A 1
1 B 2
2 C 3
If you are reading from multiple Excel workbooks and also want to include the file names, you can do something like this:
workbooks = ['myfile.xlsx', 'myfile2.xlsx'] # `myfile2` here a copy of `myfile`
nested_dict = {(wb,k): v for wb in workbooks
for k, v in pd.read_excel(wb, sheet_name=None).items()}
df = pd.concat([v for v in nested_dict.values()],
keys=nested_dict.keys(),
names=['Filename', 'Sheets', 'Rows'])
df
Col1 Col2
Filename Sheets Rows
myfile.xlsx Sheet1 0 A 1
1 B 2
2 C 3
Sheet2 0 D 4
1 E 5
2 F 6
myfile2.xlsx Sheet1 0 A 1
1 B 2
2 C 3
Sheet2 0 D 4
1 E 5
2 F 6
Or again:
df = pd.concat(nested_dict, names=['Filename', 'Sheets', 'Rows'])
# same result
Selecting Sheet1 from the first workbook:
df.loc[('myfile.xlsx', 'Sheet1')]
Col1 Col2
Rows
0 A 1
1 B 2
2 C 3