Search code examples
pythonpandassorting

Python pandas multi-column sorting problem


I want to sort the first column according to the internal algorithm, and then sort the second column according to the custom sorting method

The test data is as follows:

     A              B
Ankang Shaanxi      Ankang Southeast
Baoding Anguo       Baoding Anguo Northeast
Baoding Anguo       Baoding Anguo Southeast
Changsha Hunan      Changsha Hunan Bright
Ankang Shaanxi      Ankang Northeast
Baoding Anguo       Baoding Anguo Southwest
Baoding Anguo       Baoding Anguo Upper
Ankang Shaanxi      Ankang Southwest    
Luoyang Henan       Luoyang Henan Upper
Baoding Anguo       Baoding Anguo Northwest
Changsha Hunan      Changsha Hunan Lower
Ankang Shaanxi      Ankang Southwest Upper  
Ankang Shaanxi      Ankang Northwest

I hope to be able to arrange it as shown below

The first column is sorted together using pandas' built-in string sorting algorithm, and then the second column is sorted using the custom order algorithm of northeast, southeast, northwest, southwest,upper.

I used pandas' sort_values() method to sort. I had no problem sorting a single column, but it always failed when I tried to sort two columns together.

import pandas as pd

data={'A':['Ankang Shaanxi','Baoding Anguo','Baoding Anguo','Changsha Hunan','Ankang Shaanxi',
'Baoding Anguo','Baoding Anguo','Ankang Shaanxi','Luoyang Henan','Baoding Anguo',
'Changsha Hunan','Ankang Shaanxi','Ankang Shaanxi'],
'B':['Ankang Southeast','Baoding Anguo Northeast','Baoding Anguo Southeast','Changsha Hunan Bright','Ankang Northeast','Baoding Anguo Southwest','Baoding Anguo Upper','Ankang Southwest','Luoyang Henan Upper','Baoding Anguo Northwest','Changsha Hunan Lower','Ankang Southwest Upper','Ankang Northwest']}

df=pd.DataFrame(data)
def sort_fun(x):

      return x.split()[-1]

df['sort_value']=df['B'].apply(sort_fun)

sort_dicts={'Northeast':0,'Southeast':1,'Northwest':2,'Southwest':3,'Upper':4}

df.sort_values(by=['A','sort_value'],key=lambda x :x.map(sort_dicts))

I referred to it Pandas: How to custom-sort on multiple columns?

      A                    B
Ankang Shaanxi      Ankang Northeast
Ankang Shaanxi      Ankang Southeast
Ankang Shaanxi      Ankang Northwest
Ankang Shaanxi      Ankang Southwest    
Ankang Shaanxi      Ankang Southwest Upper
Baoding Anguo       Baoding Anguo Northeast
Baoding Anguo       Baoding Anguo Southeast
Baoding Anguo       Baoding Anguo Northwest
Baoding Anguo       Baoding Anguo Southwest
Baoding Anguo       Baoding Anguo Upper
Changsha Hunan      Changsha Hunan Bright
Changsha Hunan      Changsha Hunan Lower
Luoyang Henan       Luoyang Henan Upper

Solution

  • The basic logic you can use for column 'B':

    df['B'].str.split().str[-1].map(sort_dicts)
    
    0     1.0
    1     0.0
    2     1.0
    3     NaN
    4     0.0
    5     3.0
    6     4.0
    7     3.0
    8     4.0
    9     2.0
    10    NaN
    11    4.0
    12    2.0
    Name: B, dtype: float64
    

    Couple of ways to sort using this logic:

    Option 1

    Chain calls to df.sort_values:

    # note 'B' first
    
    def sort_fun(s):
        return s.str.split().str[-1].map(sort_dicts)
    
    out = (df.sort_values('B', key=sort_fun)
           .sort_values('A', ignore_index=True)
           )
    

    Option 2

    Adjust sort_fun to only affect col 'B':

    def sort_fun2(s, name):
        if s.name == name: # for 'B'
            return s.str.split().str[-1].map(sort_dicts)
        return s
    
    out2 = df.sort_values(['A', 'B'], key=lambda x: sort_fun2(x, 'B'),
                          ignore_index=True)
    

    Indeed, your original approach also applied the function passed to key to df['A'] (i.e., df['A'].map(sort_dicts)), leading to a series with NaN values to "sort".

    Option 3

    Use np.lexsort as suggested by @mozway in the linked post:

    # again: note 'B' goes first
    
    import numpy as np
    
    sort = np.lexsort((df['B'].str.split().str[-1].map(sort_dicts), 
                       df['A']))
    
    out3 = df.iloc[sort].reset_index(drop=True)
    

    Output

    out
    
                     A                        B
    0   Ankang Shaanxi         Ankang Northeast
    1   Ankang Shaanxi         Ankang Southeast
    2   Ankang Shaanxi         Ankang Northwest
    3   Ankang Shaanxi         Ankang Southwest
    4   Ankang Shaanxi   Ankang Southwest Upper
    5    Baoding Anguo  Baoding Anguo Northeast
    6    Baoding Anguo  Baoding Anguo Southeast
    7    Baoding Anguo  Baoding Anguo Northwest
    8    Baoding Anguo  Baoding Anguo Southwest
    9    Baoding Anguo      Baoding Anguo Upper
    10  Changsha Hunan    Changsha Hunan Bright
    11  Changsha Hunan     Changsha Hunan Lower
    12   Luoyang Henan      Luoyang Henan Upper
    

    Equality check with desired output:

    data2 = {'A': ['Ankang Shaanxi', 'Ankang Shaanxi', 'Ankang Shaanxi', 
                   'Ankang Shaanxi', 'Ankang Shaanxi', 'Baoding Anguo', 
                   'Baoding Anguo', 'Baoding Anguo', 'Baoding Anguo', 
                   'Baoding Anguo', 'Changsha Hunan', 'Changsha Hunan', 
                   'Luoyang Henan'], 
             'B': ['Ankang Northeast', 'Ankang Southeast', 'Ankang Northwest', 
                   'Ankang Southwest', 'Ankang Southwest Upper', 
                   'Baoding Anguo Northeast', 'Baoding Anguo Southeast', 
                   'Baoding Anguo Northwest', 'Baoding Anguo Southwest', 
                   'Baoding Anguo Upper', 'Changsha Hunan Bright', 
                   'Changsha Hunan Lower', 'Luoyang Henan Upper']}
    desired = pd.DataFrame(data2)
    
    all(df.equals(desired) for df in [out, out2, out3])
    # True