Search code examples
pythonpandasindexingmulti-index

Taking subset of multi-index pandas df, unexpected behavior with index


I have data from many repetitions of a particular data stream organized as multi-indexed dataframes (where each repetition is marked as e.g. ['rep1', 'rep2', .., 'repN']). I often need to take a subset of the larger dataframe in a range of these repetitions (e.g. df.loc['rep5':'rep50', :]).

I have not been able to figure out a way to do this, though, without the index of the subsequent subset still retaining the entire list of index values from the larger datarame (i.e. ['rep1', 'rep2', .., 'repN']).

So, for a simplified example, given the following df:

dfs = [pd.DataFrame({'vals': range(3)}) for i in range(3)]
df = pd.concat(dfs, keys=['l1', 'l2', 'l3'])

df

      vals
l1 0     0
   1     1
   2     2
l2 0     0
   1     1
   2     2
l3 0     0
   1     1
   2     2

And then taking a subset of it:

subset = df.loc['l2':, :]
subset
      vals
l2 0     0
   1     1
   2     2
l3 0     0
   1     1
   2     2

Looking at the index of the subset, the original 'l1' index remains:

subset.index
MultiIndex(levels=[['l1', 'l2', 'l3'], [0, 1, 2]],
           labels=[[1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2]

If I reset that index level, however, 'l1' seems to disappear:

subset.reset_index(level=0)
    level_0 vals
0   l2  0
1   l2  1
2   l2  2
0   l3  0
1   l3  1
2   l3  2

And then I can put 'level_0' back in as an index, essentially arriving at what I'm trying to achieve

subset.reset_index(level=0).set_index('level_0', append=True).reorder_levels([1, 0]).index
MultiIndex(levels=[['l2', 'l3'], [0, 1, 2]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
           names=['level_0', None])

However this is obviously a very circuitous route. The other option I suppose is to drop the other rows, but I find df.drop to be very unwieldy when trying to do a range of rows for a multi-index df.

This behavior does not occur if the dataframe is not hierarchical. E.g.:

df = pd.DataFrame({'vals': range(5)}, index=['a', 'b', 'c', 'd', 'e'])

df.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

and then taking a subset

subset = df.loc[('b', 'c', 'd'),:]
subset.index
Index(['b', 'c', 'd'], dtype='object')

It's not clear to me why this is behaving the way that it is.


Solution

  • What, I think, you need is pd.MultiIndex.remove_unused_levels

    subset.index.remove_unused_levels()
    

    Output:

    MultiIndex(levels=[['l2', 'l3'], [0, 1, 2]],
               labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])