Search code examples
pythonpandasnumpydataframemulti-index

Why does pandas multi-index dataframe slicing seem inconsistent?


Why is it that when slicing a multi-index dataframe, you can get away with simpler syntax as long as you are slicing the level-0 index? Here is an example dataframe:

           hi
a b   c      
1 foo baz   0
      can   1
  bar baz   2
      can   3
2 foo baz   4
      can   5
  bar baz   6
      can   7
3 foo baz   8
      can   9
  bar baz  10
      can  11

These work:

df.loc[1, 'foo', :]
df.loc[1, :, 'can']

While this doesn't:

df.loc[:, 'foo', 'can']

Forcing me to use one of these instead:

df.loc[(slice(None), 'foo', 'can'), :]
df.loc[pd.IndexSlice[:, 'foo', 'can'], :]

Below are the same examples but with more detail:

In [1]: import pandas as pd
import numpy as np

ix = pd.MultiIndex.from_product([[1, 2, 3], ['foo', 'bar'], ['baz', 'can']], names=['a', 'b', 'c'])
data = np.arange(len(ix))
df = pd.DataFrame(data, index=ix, columns=['hi'])
print df

           hi
a b   c      
1 foo baz   0
      can   1
  bar baz   2
      can   3
2 foo baz   4
      can   5
  bar baz   6
      can   7
3 foo baz   8
      can   9
  bar baz  10
      can  11

In [2]: df.sort_index(inplace=True)
print df.loc[1, 'foo', :]

           hi
a b   c      
1 foo baz   0
      can   1

In [3]: print df.loc[1, :, 'can']

           hi
a b   c      
1 bar can   3
  foo can   1

In [4]: print df.loc[:, 'foo', 'can']

KeyError: 'the label [foo] is not in the [columns]'

In [5]: print df.loc[(slice(None), 'foo', 'can'), :]

           hi
a b   c      
1 foo can   1
2 foo can   5
3 foo can   9

In [6]: print df.loc[pd.IndexSlice[:, 'foo', 'can'], :]

           hi
a b   c      
1 foo can   1
2 foo can   5
3 foo can   9

Solution

  • All three examples are technically ambiguous, but in the first two, Pandas guesses your intent correctly. Since slicing rows, selecting columns (i.e., df.loc[:, columns]) is a common idiom, the inference seems to pick that interpretation.

    The inference is kind of messy, so I think it's much better to be explicit. It's not that much extra typing if you alias IndexSlice:

    idx = pd.IndexSlice
    df.loc[idx[1, 'foo'], :]
    df.loc[idx[1, :, 'can'], :]
    df.loc[idx[:, 'foo', 'can'], :]