Search code examples
pythonexcelpandasmulti-index

Select columns when reading excel with multiindex header in Pandas


I have an excel sheet that looks like this and I would like to read it as a multi-indexed Pandas Dataframe:

table

desired result:

       M1        M2     
site conc   LQ conc   LQ
C1 A    1  0.5   10  0.5
   B    2  0.6    5  0.6
   C  <LQ  1.0   14  1.0
C2 A  <LQ  1.1   12  1.1
   C    3  0.5   14  0.5
   D    1  0.5   11  0.5

The following code used to produce the desired result:

df = pd.read_excel("test.xlsx", 
                   sheet_name="test",
                   header=[1,2],
                   index_col=[0,1],
                   usecols='A:B,D:G',
                   parse_dates = False,
                   )

But now (pandas version 0.25.1) it produces the following error:

ValueError: cannot specify usecols when specifying a multi-index header

Any ideas for a simple work-around?

To reproduce the excel table:

;;start;M1;;M2;;end;M1;;M2;
cam;site;bla;conc;LQ;conc;LQ;bla;conc;LQ;conc;LQ
C1;A;bla;1;0.5;10;0.5;bla;1;0.5;10;0.5
C1;B;bla;2;0.6;5;0.6;bla;2;0.6;5;0.6
C1;C;bla;<LQ;1;14;1;bla;<LQ;1;14;1
C2;A;bla;<LQ;1.1;12;1.1;bla;<LQ;1.1;12;1.1
C2;C;bla;3;0.5;14;0.5;bla;3;0.5;14;0.5
C2;D;bla;1;0.5;11;0.5;bla;1;0.5;11;0.5

Solution

  • Hello I had the same problem, you can't use usecols when headers is a multiple index

    I solved the problem by doing this:

    df = pd.read_excel("test.xlsx",
                       sheet_name="test",
                       header=2,
                       index_col=[0,1],
                       usecols='A:B,D:G',
                       parse_dates=False)
    index = pd.read_excel("test.xlsx",
                          sheet_name="test",
                          header=None,
                          skiprows=1,
                          index_col=[0,1],
                          nrows=2,
                          usecols='A:B,D:G',
                          parse_dates=False)
    index = index.fillna(method='ffill', axis=1)
    df.columns = pd.MultiIndex.from_arrays(index.values)
    df