Search code examples
pythonpandasdataframeindexing

Name all rows with the Index name


I have a strange request, easy to understand but should be very complicated to realize in my opinion. I have a dataframe composed of several spreadsheet excel (using df concat).

I would like to use all the df names (the excel spreadsheet name) to name all my rows, to know in which spreadsheet is coming from this row. Because thereafter, I will mix those datas.

Thank you for your time/help :)

Use df name as rows name.


Solution

  • Suppose we have an Excel workbook like this:

    Excel workbook with 2 sheets

    We can use pd.read_excel with sheet_name=None to read this into a dictionary with the sheet names as keys:

    import pandas as pd
    
    file_name = 'myfile.xlsx'
    
    dict_df = pd.read_excel(file_name, sheet_name=None)
    
    dict_df
    
    {'Sheet1':   Col1  Col2
     0    A     1
     1    B     2
     2    C     3,
     'Sheet2':   Col1  Col2
     0    D     4
     1    E     5
     2    F     6}
    

    Now, we can use pd.concat to add the dictionary keys to the index values. As of pd 2.0.0 you can use df.add_prefix with axis=0:

    # pd >= 2.0.0
    df = pd.concat([v.add_prefix(f"{k}_", axis=0) for k, v in dict_df.items()])
    
    df
    
             Col1  Col2
    Sheet1_0    A     1
    Sheet1_1    B     2
    Sheet1_2    C     3
    Sheet2_0    D     4
    Sheet2_1    E     5
    Sheet2_2    F     6
    

    For earlier versions, you can overwrite the index using df.set_index:

    # pd < 2.0.0
    df = pd.concat([v.set_index(f"{k}_" + v.index.astype(str)) 
                    for k, v in dict_df.items()])
    
    # same result
    

    But it may be nicer to add the sheet names as a separate level, making use of the keys and names parameters:

    df = pd.concat([v for v in dict_df.values()], 
                   keys=dict_df.keys(), 
                   names=['Sheets', 'Rows'])
    
    df
    
                Col1  Col2
    Sheets Rows           
    Sheet1 0       A     1
           1       B     2
           2       C     3
    Sheet2 0       D     4
           1       E     5
           2       F     6
    

    Or simply:

    df = pd.concat(dict_df, names=['Sheets', 'Rows'])
    # same result
    

    E.g., this way you can easily select a particular sheet:

    df.loc['Sheet1']
    
         Col1  Col2
    Rows           
    0       A     1
    1       B     2
    2       C     3
    

    If you are reading from multiple Excel workbooks and also want to include the file names, you can do something like this:

    workbooks = ['myfile.xlsx', 'myfile2.xlsx'] # `myfile2` here a copy of `myfile`
    
    nested_dict = {(wb,k): v for wb in workbooks 
                   for k, v in pd.read_excel(wb, sheet_name=None).items()}
    
    df = pd.concat([v for v in nested_dict.values()], 
                   keys=nested_dict.keys(), 
                   names=['Filename', 'Sheets', 'Rows'])
    
    df
    
                             Col1  Col2
    Filename     Sheets Rows           
    myfile.xlsx  Sheet1 0       A     1
                        1       B     2
                        2       C     3
                 Sheet2 0       D     4
                        1       E     5
                        2       F     6
    myfile2.xlsx Sheet1 0       A     1
                        1       B     2
                        2       C     3
                 Sheet2 0       D     4
                        1       E     5
                        2       F     6
    

    Or again:

    df = pd.concat(nested_dict, names=['Filename', 'Sheets', 'Rows'])
    # same result
    

    Selecting Sheet1 from the first workbook:

    df.loc[('myfile.xlsx', 'Sheet1')]
    
         Col1  Col2
    Rows           
    0       A     1
    1       B     2
    2       C     3