Search code examples
pythonpandasstripordereddict

pandas OrderedDict strip space read_excel


I have an excel file with over 100 sheets. I need to import into pandas and create a dataframe. The problem is some columns headings contain spaces so I get a dataframe with some duplicate columns.

Is it possible to strip a space in columns in OrderedDict? I know I can use str.strip() in the dataframe but can't find anything for OrderedDict.

I have the following structure in excel:

sheet TEST:
   'ID1' 'ID2'  'ID3'  'ID4'
1   A     2016   val    val 
2   B     2017   val    val
3   C     2018   val    val

sheet TEST2:
   'ID1' 'ID2'  'ID3 ' 'ID4'
1   A     2016   val    val 
2   B     2017   val    val
3   C     2018   val    val

sheet TEST3:
   'ID1' 'ID2'  'ID3' 'ID4 '
1   A     2016   val    val 
2   B     2017   val    val
3   C     2018   val    val

At the moment I do the following:

df = pd.read_excel (File location, sheet_name = ['TEST', 'TEST2', 'TEST3'])

df = pd.concat(df, axis=0, sort=False)
df = df.reset_index()

This is what I get:

   'ID1'  'ID2'  'ID3'  'ID4' 'ID3 ' 'ID4 '
1   A     2016    val    val   NaN    NaN
2   B     2017    val    val   NaN    NaN
3   C     2018    val    val   NaN    NaN
4   A     2016    NaN    val   val    NaN
5   B     2017    NaN    val   val    NaN
6   C     2018    NaN    val   val    NaN
7   A     2016    val    NaN   NaN    val
8   B     2017    val    NaN   NaN    val
9   C     2018    val    NaN   NaN    val

This is what I need:

   'ID1'  'ID2'  'ID3'  'ID4' 
1   A     2016    val    val   
2   B     2017    val    val   
3   C     2018    val    val   
4   A     2016    val    val   
5   B     2017    val    val   
6   C     2018    val    val   
7   A     2016    val    val   
8   B     2017    val    val   
9   C     2018    val    val   

Many thanks


Solution

  • If you are reading in multiple sheets, and your workbook is large, it's more efficient to use pd.ExcelFile to create an ExcelFile object, then access the sheets individually. It also makes the column name formatting very straight-forward:

    import pandas as pd
    
    xls = pd.ExcelFile(filepath)
    
    dfs = []
    for sheet in xls.sheet_names:
        df = pd.read_excel(xls, sheet)
        df.columns = df.columns.str.strip()
        dfs.append(df)
    
    df = pd.concat(dfs)