Search code examples
pythonpandasmulti-index

pandas Multiindex columns method selection returns all columns instead of subset


Hi here is an example of behaviour I don't understand. Here is an example of multiindex in columns

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)

Now i'd like to select a subset of df by the first level and return the relevant columns:

df.loc[:, ['bar']].columns

returns

MultiIndex(levels=[['bar'], ['one', 'two']],
           labels=[[0, 0], [0, 1]],
           names=['first', 'second'])

but

df.loc[:, ['bar', 'baz']].columns

returns

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=['first', 'second'])

Any reason why the second one would return ALL the column names instead of

MultiIndex(levels=[['bar', 'baz'], ['one', 'two']] etc...

And more importantly any quick fix so I can return only the relevant data?

This is made even more relevant as pandas are deprecating Panels (which used to be a very elegant way to store multidimensional data)


Solution

  • In new pandas version (0.20.1) use MultiIndex.remove_unused_levels:

    print (df.loc[:, ['bar', 'baz']].columns)
    MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
               labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
               names=['first', 'second'])
    
    print (df.loc[:, ['bar', 'baz']].columns.remove_unused_levels())
    MultiIndex(levels=[['bar', 'baz'], ['one', 'two']],
               labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
               names=['first', 'second'])
    

    Advanced shown levels in documentation.