Search code examples
pythonpandasmulti-index

How can I sort multiple levels of a pandas MultiIndex with custom key functions?


Say I have a dataframe with a MultiIndex like this:

import pandas as pd
import numpy as np

my_index = pd.MultiIndex.from_product(
    [(3,1,2), ("small", "tall", "medium"), ("B", "A", "C")],
    names=["number", "size", "letter"]
)

df_0 = pd.DataFrame(np.random.rand(27, 2), columns=["x", "y"], index=my_index)
                             x         y
number size   letter
3      small  B       0.950073  0.599918
              A       0.014450  0.472736
              C       0.208064  0.778538
       tall   B       0.979631  0.367234
              A       0.832459  0.449875
              C       0.761929  0.053144
       medium B       0.460764  0.800131
              A       0.355746  0.573813
              C       0.078924  0.058865
1      small  B       0.405209  0.354636
              A       0.536242  0.012904
              C       0.458910  0.723627
       tall   B       0.859898  0.442954
              A       0.109729  0.885598
              C       0.378363  0.220695
       medium B       0.652191  0.685181
              A       0.503525  0.400973
              C       0.454671  0.188798
2      small  B       0.407654  0.168782
              A       0.393451  0.083023
              C       0.073432  0.165209
       tall   B       0.678226  0.108497
              A       0.718348  0.077935
              C       0.595500  0.146271
       medium B       0.719985  0.422167
              A       0.950950  0.532390
              C       0.687721  0.920229

Now I want to sort the index by the different levels, first number, then size, and finally letter.

If I do this...

df_1 = df_0.sort_index(level=["number", "size", "letter"], inplace=False)

... the size of course gets sorted in alphabetical order.

                             x         y
number size   letter
1      medium A       0.503525  0.400973
              B       0.652191  0.685181
              C       0.454671  0.188798
       small  A       0.536242  0.012904
              B       0.405209  0.354636
              C       0.458910  0.723627
       tall   A       0.109729  0.885598
              B       0.859898  0.442954
              C       0.378363  0.220695
2      medium A       0.950950  0.532390
              B       0.719985  0.422167
              C       0.687721  0.920229
       small  A       0.393451  0.083023
              B       0.407654  0.168782
              C       0.073432  0.165209
       tall   A       0.718348  0.077935
              B       0.678226  0.108497
              C       0.595500  0.146271
3      medium A       0.355746  0.573813
              B       0.460764  0.800131
              C       0.078924  0.058865
       small  A       0.014450  0.472736
              B       0.950073  0.599918
              C       0.208064  0.778538
       tall   A       0.832459  0.449875
              B       0.979631  0.367234
              C       0.761929  0.053144

But I want it to be sorted by a custom key. I know I can sort the size level with a custom sort function like this:

custom_key = np.vectorize(lambda x: {"small": 0, "medium": 1, "tall": 2}[x])

df_2 = df_0.sort_index(level=1, key=custom_key, inplace=False)
                             x         y
number size   letter
1      small  A       0.536242  0.012904
              B       0.405209  0.354636
              C       0.458910  0.723627
2      small  A       0.393451  0.083023
              B       0.407654  0.168782
              C       0.073432  0.165209
3      small  A       0.014450  0.472736
              B       0.950073  0.599918
              C       0.208064  0.778538
1      medium A       0.503525  0.400973
              B       0.652191  0.685181
              C       0.454671  0.188798
2      medium A       0.950950  0.532390
              B       0.719985  0.422167
              C       0.687721  0.920229
3      medium A       0.355746  0.573813
              B       0.460764  0.800131
              C       0.078924  0.058865
1      tall   A       0.109729  0.885598
              B       0.859898  0.442954
              C       0.378363  0.220695
2      tall   A       0.718348  0.077935
              B       0.678226  0.108497
              C       0.595500  0.146271
3      tall   A       0.832459  0.449875
              B       0.979631  0.367234
              C       0.761929  0.053144

But how can I sort by all levels like for df_1 and use the custom key on the second level? Expected output:

                             x         y
number size   letter
1      small  A       0.536242  0.012904
              B       0.405209  0.354636
              C       0.458910  0.723627
       medium A       0.503525  0.400973
              B       0.652191  0.685181
              C       0.454671  0.188798
       tall   A       0.109729  0.885598
              B       0.859898  0.442954
              C       0.378363  0.220695
2      small  A       0.393451  0.083023
              B       0.407654  0.168782
              C       0.073432  0.165209
       medium A       0.950950  0.532390
              B       0.719985  0.422167
              C       0.687721  0.920229
       tall   A       0.718348  0.077935
              B       0.678226  0.108497
              C       0.595500  0.146271
3      small  A       0.014450  0.472736
              B       0.950073  0.599918
              C       0.208064  0.778538
       medium A       0.355746  0.573813
              B       0.460764  0.800131
              C       0.078924  0.058865
       tall   A       0.832459  0.449875
              B       0.979631  0.367234
              C       0.761929  0.053144

And how should I define the custom key function, so that I also can access the level in sort_index by name like this?

df_3 = df_0.sort_index(level="size", key=custom_key, inplace=False)

Here, it gives a KeyError: 'Level size not found'


Solution

  • The ideal would be to use ordered Categorical data.

    Else, you can use a custom mapper based on the level name:

    # define here custom sorters
    # all other levels will be sorted by default order
    order = {'size': ['small', 'medium', 'tall']}
    
    def sorter(s):
        if s.name in order:
            return s.map({k:v for v,k in enumerate(order[s.name])})
        return s
    
    out = df_0.sort_index(level=["number", "size", "letter"], key=sorter)
    

    Output:

                                 x         y
    number size   letter                    
    1      small  A       0.530753  0.687982
                  B       0.722848  0.974920
                  C       0.174058  0.695016
           medium A       0.397016  0.550404
                  B       0.426989  0.843007
                  C       0.929218  0.497728
           tall   A       0.159078  0.005675
                  B       0.917871  0.384265
                  C       0.685435  0.585242
    2      small  A       0.423254  0.838356
                  B       0.342158  0.209632
    ...