Search code examples
pythonpandasmulti-index

Accessing MultiIndex in a specific format


I have the following dataframe,

df = pd.DataFrame(np.random.randint(0,1000, (5,6)), \
                  columns = pd.MultiIndex.from_product([['CPC', 'Conversions'], ['April', 'June', 'May']])).rename_axis(index = {None : 'idx'})

df

      CPC                 Conversions          
      April June  May     April June  May
idx                                      
 0     663  964  971      663   76  927
 1     405  217  754      370  306   34
 2     474  229  664      354   66  885
 3      73  538  139      417  876  855
 4     619  618  618      455  134  805

I wanted to access the months column level in right format so I did this

df.loc[:, (slice(None), ['April', 'May', 'June'])]

        CPC Conversions   CPC Conversions   CPC Conversions
      April       April   May         May  June        June
idx                                                     
 0      806         202   963         975   110          55
 1      263         884   442         563   216         694
 2      462         361   780         412   858         670
 3      742         756   525          33   477         826
 4      579         332    91         802   829         231

It's weird how the result comes while it works perfectly fine when I use two months instead of three (see below)

df.loc[:, (slice(None), ['April', 'May'])]

              CPC      Conversions     
       April  May       April  May
idx                            
 0       856  619         180  593
 1        64  403         929   80
 2       973  285         803  967
 3       769  405         701  267
 4       940  368         863  717

Can anyone tell me what's wrong, it seems buggy to me.

Currently what I'm doing is creating a MultiIndex.from_product (just as I used for creating this example) with ordered months and replace the existing column with it. However, I don't want to do that since the other column levels might change over time.


Solution

  • It's not buggy, it's just unexpected:

    >>> df.columns.levels
    FrozenList([['CPC', 'Conversions'], ['April', 'June', 'May']])
    
    >>> df.columns.codes
    FrozenList([[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
    
    >>> df.columns.is_monotonic
    True
    

    We can see 'April' has the code 0, 'June' the code 1 and 'May' the code 2.

    Now you slice your dataframe by second level:

    # Your code (unordered codes)
    >>> df.loc[:, (slice(None), ['April', 'May', 'June'])]
    
      CPC Conversions  CPC Conversions  CPC Conversions
    April       April  May         May June        June
    ...
    

    but if your slice your dataframe according the ordered codes, it works as expected:

    >>> df.loc[:, (slice(None), ['April', 'June', 'May'])]
      CPC           Conversions          
    April June  May       April June  May
    ...
    

    Check some properties:

    >>> df.loc[:, (slice(None), ['April', 'May', 'June'])].columns.is_monotonic
    False
    
    >>> df.loc[:, (slice(None), ['April', 'June', 'May'])].columns.is_monotonic
    True
    
    >>> df.loc[:, (slice(None), ['April', 'May', 'June'])].columns.codes
    FrozenList([[0, 1, 0, 1, 0, 1], [0, 0, 2, 2, 1, 1]])
    
    >>> df.loc[:, (slice(None), ['April', 'June', 'May'])].columns.codes
    FrozenList([[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
    

    If you use sort_index after your slice, the dataframe is sorted as its creation:

    >>> df.loc[:, (slice(None), ['April', 'May', 'June'])].sort_index(axis=1)
      CPC           Conversions          
    April June  May       April June  May
    ...
    

    Sorting MultiIndex is based on codes.