Search code examples
pythonpandassklearn-pandas

Pandas groupby -- get output value based on max value of another column


I have the following dataframe:

df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'Habitat':['Jungle', 'Jungle',
                              'Sky', 'Sky'],
                   'Tmp':['A', 'B', 'C', 'D'],
                   'Max Speed': [380., 370., 24., 26.]})

>>> df
   Animal Habitat Tmp  Max Speed
0  Falcon  Jungle   A      380.0
1  Falcon  Jungle   B      370.0
2  Parrot     Sky   C       24.0
3  Parrot     Sky   D       26.0

I am trying to add additional column "Output" which will take the value from "Tmp" based on maximum value of column "Max Speed" in a groupby done of columns "Animal" and "Habitat".

Desired output:

   Animal Habitat Tmp  Max Speed Output
0  Falcon  Jungle   A      380.0      A
1  Falcon  Jungle   B      370.0      A
2  Parrot     Sky   C       24.0      D
3  Parrot     Sky   D       26.0      D

It can be done using a groupby and then joining it in the original dataset. But is there a more efficient way to do this? Maybe using transform or something else?


Solution

  • You can define a function taking pd.dataframe as argument:

    import pandas as pd
    import numpy as np
    
    def fmax(df_):
        df_['Output'] = df_.sort_values(['Max Speed']).tail(1)['Tmp'].squeeze()
        return df_
    

    Please note use of pandas.DataFrame.squeeze function to return scalar value. Then simply apply above function using groupby:

    df.groupby(['Animal','Habitat']).apply(fmax)
    

    The result is:

       Animal Habitat Tmp  Max Speed Output
    0  Falcon  Jungle   A      380.0      A
    1  Falcon  Jungle   B      370.0      A
    2  Parrot     Sky   C       24.0      D
    3  Parrot     Sky   D       26.0      D