Search code examples
pythonpandas

How to sort a pandas dataframe using group by


I am working on a dataframe similar to below sample:

import pandas as pd
import numpy as np

np.random.seed(0)
np.random.seed(0)
df = pd.DataFrame({'date' : np.tile(['2024-05-01', '2024-06-01'], 4),
                  'State' : np.repeat(['fl', 'ny', 'mi', 'nc'], 2),
                  'Rev' : [21000, 18200, 51200, 48732, 5676, 6798, 24012, 25005],
                  'Score' : np.random.normal(size = 8),
                  'Value' : np.random.randint(10, 50, size = 8)})
df

    date        State   Rev     Score       Value
0   2024-05-01  fl      21000   1.764052    34
1   2024-06-01  fl      18200   0.400157    22
2   2024-05-01  ny      51200   0.978738    11
3   2024-06-01  ny      48732   2.240893    48
4   2024-05-01  mi       5676   1.867558    49
5   2024-06-01  mi       6798   -0.977278   33
6   2024-05-01  nc      24012   0.950088    34
7   2024-06-01  nc      25005   -0.151357   27

Expected output should be the dataframe sorted by Rev, largest to the smallest, and within each State, the date column should be sorted from in ascending order.

Tried below code:

(df.sort_values(by = ['Rev'], ascending = [False]).
     groupby('State', as_index = False).
     apply(lambda x : x.sort_values('date')).reset_index(drop = True))

But it's not giving me the required output.

    date        State   Rev     Score               Value
0   2024-05-01  fl      21000   1.764052345967664   34
1   2024-06-01  fl      18200   0.4001572083672233  22
2   2024-05-01  mi       5676   1.8675579901499675  49
3   2024-06-01  mi       6798   -0.977277879876411  33
4   2024-05-01  nc      24012   0.9500884175255894  34
5   2024-06-01  nc      25005   -0.1513572082976979 27
6   2024-05-01  ny      51200   0.9787379841057392  11
7   2024-06-01  ny      48732   2.240893199201458   48

The output should be NY, NC, FL and MI in that order based on the Rev and date columns. i.e. for a State group, the Rev value for 2024-05-01 will decide which state will take precedence in the final output order.

Can someone help me with the code.

Expected Output:

df.iloc[[2,3, 6,7, 0,1, 4,5], : ]


    date        State   Rev     Score       Value
2   2024-05-01  ny      51200   0.978738    11
3   2024-06-01  ny      48732   2.240893    48
6   2024-05-01  nc      24012   0.950088    34
7   2024-06-01  nc      25005   -0.151357   27
0   2024-05-01  fl      21000   1.764052    34
1   2024-06-01  fl      18200   0.400157    22
4   2024-05-01  mi       5676   1.867558    49
5   2024-06-01  mi       6798   -0.977278   33

Solution

  • IMO, the easiest and most explicit approach to perform "complex"/multi-condition sorts is to use numpy.lexsort and pass the constraints in reverse order of preference:

    out = df.iloc[np.lexsort([df['date'],
                              -df.groupby('State')['Rev'].transform('max')])]
    

    Which reads (in reverse order with lexsort):

    • sort in priority by decreasing max Rev per State
    • in case of a tie, sort by increasing date

    In case two States could have the same max Rev, if you want to ensure having separate groups, add df['State'] as an intermediate condition:

    • sort in priority by decreasing max Rev per State
    • in case of a tie in max Rev sort by State name (you could use another condition, like total Rev per State, etc.)
    • in case of a tie, sort by increasing date
    out = df.iloc[np.lexsort([df['date'],
                              df['State'],
                              -df.groupby('State')['Rev'].transform('max')])]
    

    Output:

             date State    Rev     Score  Value
    2  2024-05-01    ny  51200  0.978738     11
    3  2024-06-01    ny  48732  2.240893     48
    6  2024-05-01    nc  24012  0.950088     34
    7  2024-06-01    nc  25005 -0.151357     27
    0  2024-05-01    fl  21000  1.764052     34
    1  2024-06-01    fl  18200  0.400157     22
    4  2024-05-01    mi   5676  1.867558     49
    5  2024-06-01    mi   6798 -0.977278     33