Search code examples
pythonpandasdataframemulti-index

How to correctly use pandas sort_index with level and axis arguments?


Regarding this df:

               Amount                          type                 
Month_year 2019-06-01     2019-07-01     2019-06-01    2019-07-01   
TYPE_ID             1   2          1   2          1  2          1  2
ID                                                                  
100                20  10         40  20          1  1          2  1
200                80  60         30  10          2  2          1  1

The following code:

df = df.sort_index(axis=1, level=[1,2])

produces this:

               Amount       type     Amount  ...       type     Amount       type
Month_year 2019-06-01 2019-06-01 2019-06-01  ... 2019-07-01 2019-07-01 2019-07-01
TYPE_ID             1          1          2  ...          1          2          2
ID                                           ...                                 
100                20          1         10  ...          2         20          1
200                80          2         60  ...          1         10          1

I really don't understand what's going on. I've read the docs but there are no examples and the description is really obscure.

Could anyone explain to me how this method works and how I received this result?


Solution

  • Essentially,

    sort_index with axis=1 sorts the column headers, and this ordering is then used to set the order of the columns.

    And, the corollary,

    sort_index with axis=0 sorts the index, and this ordering is then used to set the order of the rows.


    This is what your input df looks like:

    enter image description here

    The top three "rows" in the picture above correspond to a pandas MultiIndex column of df, which looks like this:

    df.columns
    MultiIndex([('Amount', '2019-06-01', 1),
                ('Amount', '2019-06-01', 2),
                ('Amount', '2019-07-01', 1),
                ('Amount', '2019-07-01', 2),
                (  'type', '2019-06-01', 1),
                (  'type', '2019-06-01', 2),
                (  'type', '2019-07-01', 1),
                (  'type', '2019-07-01', 2)])
    

    Let's pretend your 3-level multiIndex column is magically converted into a DataFrame which each level its own column called cdf:

    cdf
        level_0     level_1  level_2
    (1)  Amount  2019-06-01        1
    (2)  Amount  2019-06-01        2
    (3)  Amount  2019-07-01        1
    (4)  Amount  2019-07-01        2
    (5)    type  2019-06-01        1
    (6)    type  2019-06-01        2
    (7)    type  2019-07-01        1
    (8)    type  2019-07-01        2
    

    The row numbers here correspond to the column identifiers in the original DataFrame. Let's see what happens when we sort cdf by the last two columns:

    cdf.sort_values(['level_1', 'level_2'])
    
        level_0     level_1  level_2
    (1)  Amount  2019-06-01        1
    (5)    type  2019-06-01        1
    (2)  Amount  2019-06-01        2
    (6)    type  2019-06-01        2
    (3)  Amount  2019-07-01        1
    (7)    type  2019-07-01        1
    (4)  Amount  2019-07-01        2
    (8)    type  2019-07-01        2
    

    Notice the index of the sorted cdf:

    (1) (5) (2) (6) (3) (7) (4) (8)
    

    Now let's see what happens when we apply the sort_index operation to df:

    df.sort_index(level=[1, 2], axis=1)
    

    enter image description here

    The ellipses in the center indicate that not all columns can be displayed due to the width of the terminal (in fact, columns (6) and (3) are not displayed but they are there very much there), but that's not the interesting part. Contrast the column ordering here with the row ordering of sorted cdf, you'll see these are are identical.