I have an excel file with over 100 sheets. I need to import into pandas and create a dataframe. The problem is some columns headings contain spaces so I get a dataframe with some duplicate columns.
Is it possible to strip a space in columns in OrderedDict? I know I can use str.strip()
in the dataframe but can't find anything for OrderedDict.
I have the following structure in excel:
sheet TEST:
'ID1' 'ID2' 'ID3' 'ID4'
1 A 2016 val val
2 B 2017 val val
3 C 2018 val val
sheet TEST2:
'ID1' 'ID2' 'ID3 ' 'ID4'
1 A 2016 val val
2 B 2017 val val
3 C 2018 val val
sheet TEST3:
'ID1' 'ID2' 'ID3' 'ID4 '
1 A 2016 val val
2 B 2017 val val
3 C 2018 val val
At the moment I do the following:
df = pd.read_excel (File location, sheet_name = ['TEST', 'TEST2', 'TEST3'])
df = pd.concat(df, axis=0, sort=False)
df = df.reset_index()
This is what I get:
'ID1' 'ID2' 'ID3' 'ID4' 'ID3 ' 'ID4 '
1 A 2016 val val NaN NaN
2 B 2017 val val NaN NaN
3 C 2018 val val NaN NaN
4 A 2016 NaN val val NaN
5 B 2017 NaN val val NaN
6 C 2018 NaN val val NaN
7 A 2016 val NaN NaN val
8 B 2017 val NaN NaN val
9 C 2018 val NaN NaN val
This is what I need:
'ID1' 'ID2' 'ID3' 'ID4'
1 A 2016 val val
2 B 2017 val val
3 C 2018 val val
4 A 2016 val val
5 B 2017 val val
6 C 2018 val val
7 A 2016 val val
8 B 2017 val val
9 C 2018 val val
Many thanks
If you are reading in multiple sheets, and your workbook is large, it's more efficient to use pd.ExcelFile
to create an ExcelFile
object, then access the sheets individually. It also makes the column name formatting very straight-forward:
import pandas as pd
xls = pd.ExcelFile(filepath)
dfs = []
for sheet in xls.sheet_names:
df = pd.read_excel(xls, sheet)
df.columns = df.columns.str.strip()
dfs.append(df)
df = pd.concat(dfs)