Search code examples
pythonpandasdataframenumpy

Bin data frame and calculate average


I have data which contains position in one column and values in other columns. I wanted to plot a line graph by binning the rows. there are more than 50,000 rows, so i want to bin data in to 50 bins and calculate average for values in the columns.

Here is my input sample data.

position    sample_1    sample_2    sample_3    sample_4
1001    21  43  28  35
1002    22  42  29  38
1003    23  46  23  42
1004    26  43  26  46
1005    31  31  31  43
1006    26  35  26  31
1007    18  36  24  35
1008    19  37  23  39
1009    21  34  22  23
1010    31  24  21  26
1011    32  23  20  31
1012    34  36  22  26
1013    25  43  23  25
1014    27  42  26  27
1015    26  46  26  26
1016    35  47  31  28

The output i am looking for is after binning sample data in to 4 columns and data is averaged for other samples.

The output i am expecting from sample data is

Bin sample_1    sample_2    sample_3    sample_4
1001-1004   23  43.5    26.5    40.25
1005-1008   23.5    34.75   26  37
1009-1012   29.5    29.25   21.25   26.5
1013-1016   28.25   44.5    26.5    26.5

Solution

  • You can use cut to define the bins, then groupby.sum:

    n_bins = 4
    
    out = (df
     .drop(columns='position')
     .groupby(pd.cut(df['position'], bins=n_bins))
     .mean().reset_index()
    )
    

    Output:

                  position  sample_1  sample_2  sample_3  sample_4
    0  (1000.985, 1004.75]     23.00     43.50     26.50     40.25
    1    (1004.75, 1008.5]     23.50     34.75     26.00     37.00
    2    (1008.5, 1012.25]     29.50     29.25     21.25     26.50
    3    (1012.25, 1016.0]     28.25     44.50     26.50     26.50
    

    If you only have integers and want to round the bins:

    n_bins = 4
    
    bins = pd.IntervalIndex(pd.cut(df['position'], bins=n_bins))
    bins = (np.ceil(bins.left).astype('Int64').astype(str)
            +'−'+
            np.floor(bins.right).astype('Int64').astype(str)
           )
    
    out = (df
     .drop(columns='position')
     .groupby(bins)
     .mean().reset_index()
    )
    

    Output:

           index  sample_1  sample_2  sample_3  sample_4
    0  1001−1004     23.00     43.50     26.50     40.25
    1  1005−1008     23.50     34.75     26.00     37.00
    2  1009−1012     29.50     29.25     21.25     26.50
    3  1013−1016     28.25     44.50     26.50     26.50