Name all rows with the Index name

I have a strange request, easy to understand but should be very complicated to realize in my opinion. I have a dataframe composed of several spreadsheet excel (using df concat).

I would like to use all the df names (the excel spreadsheet name) to name all my rows, to know in which spreadsheet is coming from this row. Because thereafter, I will mix those datas.

Thank you for your time/help :)

Use df name as rows name.

Solution

Suppose we have an Excel workbook like this:

We can use pd.read_excel with sheet_name=None to read this into a dictionary with the sheet names as keys:

import pandas as pd

file_name = 'myfile.xlsx'

dict_df = pd.read_excel(file_name, sheet_name=None)

dict_df

{'Sheet1':   Col1  Col2
 0    A     1
 1    B     2
 2    C     3,
 'Sheet2':   Col1  Col2
 0    D     4
 1    E     5
 2    F     6}

Now, we can use pd.concat to add the dictionary keys to the index values. As of pd 2.0.0 you can use df.add_prefix with axis=0:

# pd >= 2.0.0
df = pd.concat([v.add_prefix(f"{k}_", axis=0) for k, v in dict_df.items()])

df

         Col1  Col2
Sheet1_0    A     1
Sheet1_1    B     2
Sheet1_2    C     3
Sheet2_0    D     4
Sheet2_1    E     5
Sheet2_2    F     6

For earlier versions, you can overwrite the index using df.set_index:

# pd < 2.0.0
df = pd.concat([v.set_index(f"{k}_" + v.index.astype(str)) 
                for k, v in dict_df.items()])

# same result

But it may be nicer to add the sheet names as a separate level, making use of the keys and names parameters:

df = pd.concat([v for v in dict_df.values()], 
               keys=dict_df.keys(), 
               names=['Sheets', 'Rows'])

df

            Col1  Col2
Sheets Rows           
Sheet1 0       A     1
       1       B     2
       2       C     3
Sheet2 0       D     4
       1       E     5
       2       F     6

Or simply:

df = pd.concat(dict_df, names=['Sheets', 'Rows'])
# same result

E.g., this way you can easily select a particular sheet:

df.loc['Sheet1']

     Col1  Col2
Rows           
0       A     1
1       B     2
2       C     3

If you are reading from multiple Excel workbooks and also want to include the file names, you can do something like this:

workbooks = ['myfile.xlsx', 'myfile2.xlsx'] # `myfile2` here a copy of `myfile`

nested_dict = {(wb,k): v for wb in workbooks 
               for k, v in pd.read_excel(wb, sheet_name=None).items()}

df = pd.concat([v for v in nested_dict.values()], 
               keys=nested_dict.keys(), 
               names=['Filename', 'Sheets', 'Rows'])

df

                         Col1  Col2
Filename     Sheets Rows           
myfile.xlsx  Sheet1 0       A     1
                    1       B     2
                    2       C     3
             Sheet2 0       D     4
                    1       E     5
                    2       F     6
myfile2.xlsx Sheet1 0       A     1
                    1       B     2
                    2       C     3
             Sheet2 0       D     4
                    1       E     5
                    2       F     6

Or again:

df = pd.concat(nested_dict, names=['Filename', 'Sheets', 'Rows'])
# same result

Selecting Sheet1 from the first workbook:

df.loc[('myfile.xlsx', 'Sheet1')]

     Col1  Col2
Rows           
0       A     1
1       B     2
2       C     3