Search code examples
pythonpython-3.xpandasdataframexlrd

Concatenate all sheet files and create one new column with sheet names in Python


Given a excel file test.xlsx which has many sheets: 1.DC, 2.SH and 3.GZ, etc.

import xlrd
dfs = xlrd.open_workbook('./test.xlsx', on_demand = True)
print(dfs.sheet_names()) 

Out:

['1.DC', '2.SH', '3.GZ']

How could I read and concatenate all the sheets and create a new column with sheet names while removing 1., 2., 3., etc. in the starting part of each sheet names?

The expected result will like this:

enter image description here


Solution

  • Use read_excel with sheet_name=None for convert all sheets to DataFrames:

    dfs = pd.read_excel('test.xlsx', sheet_name=None)
    
    dfs = {k: v.loc[:, ~v.columns.str.contains('Unnamed')] for k, v in dfs.items()}
    
    # print(dfs) 
    

    Then if need one column use lsit comparehension with select this column, here name

    df = pd.DataFrame([(x, k) for k, v in dfs.items() for x in v['name']], 
                      columns=['name','city'])
    df['city'] = df['city'].str.replace('^[0-9\.]+', '')
    print (df)
          name    city
    0    James      DC
    1     Bond      DC
    2   Steven      DC
    3   Walker      SH
    4      Tom      SH
    5   Filler      GZ
    6   Cooker      GZ
    7      Tim      GZ
    

    Or if need all columns omit [name] here:

    dfs = pd.read_excel('test.xlsx', sheet_name=None)
    print(dfs) 
    
    df = (pd.concat(dfs)
            .reset_index(level=1, drop=True)
            .rename_axis('city')['name']
            .reset_index())
    df['city'] = df['city'].str.replace('^[0-9\.]+', '')
    print (df)
          city    name
    0       DC   James
    1       DC    Bond
    2       DC  Steven
    3       SH  Walker
    4       SH     Tom
    5       GZ  Filler
    6       GZ  Cooker
    7       GZ     Tim