Search code examples
pythonpandasdataframepandas-groupbybinning

Pandas df: group, bin and average in different columns?


I have data that looks qualitatively like this dummy table:

speed_observation, car_brand, traction_force
10, ford, 2
20, ford, 4
35, seat, 8
50, ford, 16
10, audi, 2
20, audi, 5
43, audi, 2
12, seat, 2.5
10, ford, 0.5
30, audi, 6
23, ford, 4
17, seat, 5.5
10, seat, 10
38, audi, 2
40, ford, 9
19, ford, 6.6
49, seat, 18
18, ford, 4 

I would like to group the dataframe by car brands and for each brand to bin the speed observations into ranges (e.g. [0,25] and [25,50]) and then for each brand and bin to calculate the average traction measured, receiving something like:

speed_bin_upper_lim, car_brand, avrg_traction_force_in_speed_bin
25, audi, X1
50, audi, X2
25, ford, X3
50, ford, X4
25, seat, X5
50, seat, X6

How do I do this? It should work for an arbitrary number of unique car_brand classes and the user should only provide the number of speed bins or the range of the bins (e.g. n=3 or [0,25,50]). I suppose pd.groupby and pd.cut will do it, but I didn't find how exactly.


Quang Hoang's answer works very well, and if you want to extend it, because you want to group by one more column, let's say wheel_kind, and your dataframe looks like:

speed_observation,car_brand,wheel_kind,traction_force
10, ford, winter, 2
20, ford, summer, 4
35, seat, summer, 8
50, ford, winter, 16
10, audi, summer, 2
20, audi, summer, 5
43, audi, summer, 2
12, seat, summer, 2.5
10, ford, summer, 0.5
30, audi, summer, 6
23, ford, summer, 4
17, seat, summer, 5.5
10, seat, summer, 10
38, audi, summer, 2
40, ford, summer, 9
19, ford, summer, 6.6
49, seat, summer, 18
18, ford, summer, 4

Then just add the column wheel_kind into the previous solution, more precisely:

(df.groupby(['car_brand', `wheel_kind`, cuts])
   .traction_force.mean()
   .reset_index(name='avg_traction_force')
)

and afterwards don't forget to drop the NaNs, because for ford and audi there are no winter wheels:

df_grp.dropna(inplace=True)
df_grp.reset_index(drop=True, inplace=True) #just to reset the index

Solution

  • You can just cut the speed_observation with the desired bins and group by that:

    cuts = pd.cut(df['speed_observation'], [0,25,50])
    
    (df.groupby(['car_brand', cuts])
       .traction_force.mean()
       .reset_index(name='avg_traction_force')
    )
    

    Output:

      car_brand speed_observation  avg_traction_force
    0      audi           (0, 25]            3.500000
    1      audi          (25, 50]            3.333333
    2      ford           (0, 25]            3.516667
    3      ford          (25, 50]           12.500000
    4      seat           (0, 25]            6.000000
    5      seat          (25, 50]           13.000000