Search code examples
pythonpandasdataframegroup-bycriteria

How to pick values based on aggregation of other columns in a group by with Python Pandas?


I have data that looks like this:

system    question    answer    grade    rank
sys1      q1          a1        A        5
sys1      q1          a1        B        10
sys2      q1          a1        C        1
sys2      q1          a1        D        11

My goal is to group by question and answer, taking the best grade and rank, as well as the system associated with the best grade. Below is an example of what I'd like to execute:

>>df.groupby(['question', 'answer']).agg({'grade': np.min, 'rank': np.min, 'system': ???).reset_index()
question    answer    grade    rank    system
q1          a1        A        1       sys1

I'm getting what I want for grade and rank as I simply take the minimums. What's the best way to get the desired system?


Solution

  • I think you can use sort_values first and then aggregate last :

    print (df)
      system question answer grade  rank
    0   sys1       q1     a1     A     5
    1   sys2       q1     a1     B    10
    2   sys3       q1     a1     C     1
    3   sys4       q1     a1     D    11
    
    df = df.sort_values('grade')
           .groupby(['question', 'answer'])
           .agg({'grade': np.min, 'rank': np.min, 'system':'last'})
           .reset_index()
    print (df)
      question answer system  rank grade
    0       q1     a1   sys4     1     A
    

    I m not sure, but maybe need first:

    df = df.sort_values('grade')
           .groupby(['question', 'answer'])
           .agg({'grade': np.min, 'rank': np.min, 'system':'first'})
           .reset_index()
    print (df)
      question answer system  rank grade
    0       q1     a1   sys1     1     A