Search code examples
pythonpandasdataframebinning

python : aggregate dataframe values by bin


I have a dataset with that looks like that :

|col A|col B|
   1     20
   3    123
   7     2
     ...

I would like to compute the mean value of col B over each bin of col A.

This would result in a new dataframe containing only one row per bin with :

 | mid value of the col A bin | avg value of col B over that bin |

Solution

  • As you haven't specified the number of bins and their properties, let me illustrate what you may do with pandas.cut to the example data you provided:

    import pandas as pd
    
    # reproduce your example data
    df = pd.DataFrame({'col A': [1, 3, 7],
                       'col B': [20, 123, 2]})
    
    # suggest only 2 bins would be proper for 3 rows of data
    df['col A bins'] = pd.cut(df['col A'], 
                              bins=2)
    

    Output:

    # bins may be labeled as you like, not as automatic interval
        col A   col B   col A bins
    0   1       20      (0.994, 4.0]
    1   3       123     (0.994, 4.0]
    2   7       2       (4.0, 7.0]
    

    Then we may group the initial columns by the new bins, with col A aggregated to median (as from your new column names) and col B to mean, making it look as your expected result by renaming and dropping columns:

    df.groupby('col A bins').agg({'col A': 'median',
                                  'col B': 'mean'}
                           ).rename(columns={'col A':'mid value of the col A bin',
                                             'col B':'avg value of col B over that bin'}
                           ).reset_index(drop=True)
    

    Output:

        mid value of the col A bin  avg value of col B over that bin
    0   2.0                         71.5
    1   7.0                         2.0