Search code examples
pandasdataframemulti-index

Managing MultiIndex Dataframe objects


So, yet another problem using grouped DataFrames that I am getting so confused over...

I have defined an aggregation dictionary as:

aggregations_level_1 = {
       'A': {
           'mean': 'mean',
        }, 

       'B': {
           'mean': 'mean',
        },  
    }

And now I have two grouped DataFrames that I have aggregated using the above, then joined:

grouped_top = 
     df1.groupby(['group_lvl']).agg(aggregations_level_1)
grouped_bottom = 
     df2.groupby(['group_lvl']).agg(aggregations_level_1)

Joining these:

df3 = grouped_top.join(grouped_bottom, how='left', lsuffix='_top_10', 
rsuffix='_low_10')

                      A_top_10        A_low_10     B_top_10        B_low_10
                      mean            mean         mean            mean
group_lvl                                      
    a                 3.711413    14.515901        3.711413        14.515901
    b                 4.024877    14.442106        3.694689        14.209040
    c                 3.694689    14.209040        4.024877        14.442106

Now, if I call index and columns I have:

print df3.index
    >> Index([u'a', u'b', u'c'], dtype='object', name=u'group_lvl')

print df3.columns
    >> MultiIndex(levels=[[u'A_top_10', u'A_low_10', u'B_top_10', u'B_low_10'], [u'mean']],
       labels=[[0, 1, 2, 3], [0, 0, 0, 0]])

So, it looks as though I have a regular DataFrame-object with index a,b,c but each column is a MultiIndex-object. Is this a correct interpretation?

  • How do I slice and call this? Say I would like to have only A_top_10, A_low_10 for all a,b,c?
  • Only A_top_10, B_top_10 for a and c?

I am pretty confused so any overall help would be great!


Solution

  • Need slicers, but first sort columns by sort_index else error:

    UnsortedIndexError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'

    df = df.sort_index(axis=1)
    
    idx = pd.IndexSlice
    df1 = df.loc[:, idx[['A_low_10', 'A_top_10'], :]]
    print (df1)
                A_low_10  A_top_10
                    mean      mean
    group_lvl                     
    a          14.515901  3.711413
    b          14.442106  4.024877
    c          14.209040  3.694689
    

    And:

    idx = pd.IndexSlice
    df2 = df.loc[['a','c'], idx[['A_top_10', 'B_top_10'], :]]
    print (df2)
               A_top_10  B_top_10
                   mean      mean
    group_lvl                    
    a          3.711413  3.711413
    c          3.694689  4.024877
    

    EDIT:

    So, it looks as though I have a regular DataFrame-object with index a,b,c but each column is a MultiIndex-object. Is this a correct interpretation?

    I think very close, better is say I have MultiIndex in columns.