Search code examples
pythonpandasdataframeslicemulti-index

Slicing based on a range of column in a multiindex column dataframe


I am creating my dataframe by doing the following:

months        = [ 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec' ]
monthyAmounts = [ "actual", "budgeted", "difference" ]

income = []
names  = []

for x in range( incomeIndex + 1, expensesIndex ):
    amounts = [ randint( -1000, 15000 ) for x in range( 0, len( months ) * len( monthyAmounts ) ) ]
    income.append( amounts )
    names.append( f"name_{x}" )

index    = pd.Index( names, name = 'category' )
columns  = pd.MultiIndex.from_product( [ months, monthyAmounts ], names = [ 'month', 'type' ] )
incomeDF = pd.DataFrame( income, index = index, columns = columns )

The dataframe looks like: (removed months March - December)

          Jan                            Feb                        ...             
          actual   budgeted   difference actual budgeted difference
name_13   14593     -260      10165      9767     629    10054
name_14    6178     1398      13620      1821   10986     -663
name_15    2432     3279       7545      8196    1052     7386
name_16    9964    13098      10342      5564    4631     7422

What I want is for every row, to slice the difference column for the months Jan - May. What I can do it slice the difference column for all of the months by doing:

incomeDifferenceDF = incomeDF.loc[ :, idx[ :, 'difference' ] ]

which gives me a dataframe that looks like: (months March - December removed)

         Jan        Feb          ....
         difference difference                         
name_13       10165      10054  
name_14       13620       -663  
name_15        7545       7386  
name_16       10342       7422  

What I have tried is:

incomeDifferenceDF = incomeDF.loc[ :, idx[ 'Jan' : 'May', 'difference' ] ]

but that gives me the error:

UnsortedIndexError: 'MultiIndex slicing requires the index to be lexsorted: slicing on levels [0], lexsort depth 0'

So, this seems close, but I am uncertain how to resolve the problem.

I have also tried:

incomeDifferenceDF = incomeDF.loc[ :, idx[ ['Jan':'May'], 'difference' ] ]

But that just generates the error:

SyntaxError: invalid syntax
( Points at ['Jan':'May'] )

What is the best way to do this?


Solution

  • If need select by MultiIndex, need boolean masks:

    index    = pd.Index( [1,2,3,4], name = 'category' )
    budgetMonths = pd.date_range( "January, 2018", periods = 12, freq = 'BM' ) 
    months        = [ 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
                      'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec' ]
    monthyAmounts = [ "actual", "budgeted", "difference" ]
    columns = pd.MultiIndex.from_product( [ months, monthyAmounts ], names = [ 'month', 'type' ])
    incomeDF = pd.DataFrame( 10, index = index, columns = columns )
    
    #trick for get values between 
    idx = pd.Series(0,index=months).loc['Jan' : 'May'].index
    print (idx)
    Index(['Jan', 'Feb', 'Mar', 'Apr', 'May'], dtype='object')
    
    mask1 = incomeDF.columns.get_level_values(0).isin(idx)
    mask2 = incomeDF.columns.get_level_values(1) == 'difference'
    
    incomeDifferenceDF = incomeDF.loc[:, mask1 & mask2]
    print (incomeDifferenceDF)
    month           Jan        Feb        Mar        Apr        May
    type     difference difference difference difference difference
    category                                                       
    1                10         10         10         10         10
    2                10         10         10         10         10
    3                10         10         10         10         10
    4                10         10         10         10         10