Search code examples
pythonpandasdataframesortinggroup-by

How to sort and group together the elements in a pandas dataframe in the way I want?


For instance, I create a pandas dataframe:

data = [['nick', 3, '2023/11/22 10:05:00', 'A'], ['tom', 3, '2023/11/22 9:25:00','A'], ['juli', 2, '2023/11/22 12:05:00', 'B'], ['ewa', 4, '2023/11/22 9:55:00', 'B'],['peter', 5, '2023/11/22 11:00:00', 'A'], ['johan', 5, '2023/11/22 9:00:00', 'C']]
col_name = ['name', 'score', 'time', 'category'] 
df = df = pd.DataFrame(data, columns=col_name)

enter image description here

There are six students from different groups, and they have different scores in an exam. I want to group together the students from the same group and then sort them in the group from the highest score to lowest. Then I want to put the group with the highest score in the beginning of the dataframe followed by the group with lower score. If two students have the same score, the one with earlier time should be put first.

I tried the code:

df_sorted=df.sort_values(['group','score'],ascending=False).groupby('group').apply(lambda x: x) 

It groups together the students and sorts them in the same group. The new dataframe I have now:

enter image description here

It only partly achieved what I need to do. Peter has a higher score than Eva and Juli, so Peter, Nick and Tom should be put right under Johan, followed by Ewa and Juli. Tom should be put in front of Nick since the time is earlier.

What should I do next to achieve the goal?


Solution

  • You can add an extra column with groupby.transform to add a group score (here max), then use this to sort the groups:

    out = (df
       .assign(max=df.groupby('category')['score'].transform('max'),
               dt=pd.to_datetime(df['time'])
              )
       .sort_values(by=['max', 'category', 'score', 'dt'],
                    ascending=[False, True, False, True]
                   )
       .drop(columns=['max', 'dt'])
    )
    

    Same logic without the temporary columns with numpy.lexsort:

    import numpy as np
    
    order = np.lexsort([pd.to_datetime(df['time']),
                        -df['score'],
                        df['category'],
                        -df.groupby('category')['score'].transform('max')])
    
    out = df.iloc[order]
    

    Output:

        name  score                 time category
    4  peter      5  2023/11/22 11:00:00        A
    1    tom      3   2023/11/22 9:25:00        A
    0   nick      3  2023/11/22 10:05:00        A
    5  johan      5   2023/11/22 9:00:00        C
    3    ewa      4   2023/11/22 9:55:00        B
    2   juli      2  2023/11/22 12:05:00        B
    

    breaking a tie between two groups with the same max

    Here groups A and C have the same maximum, if you want to have C on top because it has a greater average score than A, you can add an extra sorter:

    g = df.groupby('category')['score'].transform
    
    out = (df
       .assign(max=g('max'),
               mean=g('mean'),
               dt=pd.to_datetime(df['time'])
              )
       .sort_values(by=['max', 'mean', 'category', 'score', 'dt'],
                    ascending=[False, False, True, False, True]
                   )
       .drop(columns=['max', 'mean', 'dt'])
    )
    

    Or:

    import numpy as np
    
    g = df.groupby('category')['score'].transform
    order = np.lexsort([pd.to_datetime(df['time']),
                        -df['score'],
                        df['category'],
                        -g('mean'),
                        -g('max')])
    
    out = df.iloc[order]
    

    Output:

        name  score                 time category
    5  johan      5   2023/11/22 9:00:00        C
    4  peter      5  2023/11/22 11:00:00        A
    1    tom      3   2023/11/22 9:25:00        A
    0   nick      3  2023/11/22 10:05:00        A
    3    ewa      4   2023/11/22 9:55:00        B
    2   juli      2  2023/11/22 12:05:00        B