I have data which contains position in one column and values in other columns. I wanted to plot a line graph by binning the rows. there are more than 50,000 rows, so i want to bin data in to 50 bins and calculate average for values in the columns.
Here is my input sample data.
position sample_1 sample_2 sample_3 sample_4
1001 21 43 28 35
1002 22 42 29 38
1003 23 46 23 42
1004 26 43 26 46
1005 31 31 31 43
1006 26 35 26 31
1007 18 36 24 35
1008 19 37 23 39
1009 21 34 22 23
1010 31 24 21 26
1011 32 23 20 31
1012 34 36 22 26
1013 25 43 23 25
1014 27 42 26 27
1015 26 46 26 26
1016 35 47 31 28
The output i am looking for is after binning sample data in to 4 columns and data is averaged for other samples.
The output i am expecting from sample data is
Bin sample_1 sample_2 sample_3 sample_4
1001-1004 23 43.5 26.5 40.25
1005-1008 23.5 34.75 26 37
1009-1012 29.5 29.25 21.25 26.5
1013-1016 28.25 44.5 26.5 26.5
You can use cut
to define the bins, then groupby.sum
:
n_bins = 4
out = (df
.drop(columns='position')
.groupby(pd.cut(df['position'], bins=n_bins))
.mean().reset_index()
)
Output:
position sample_1 sample_2 sample_3 sample_4
0 (1000.985, 1004.75] 23.00 43.50 26.50 40.25
1 (1004.75, 1008.5] 23.50 34.75 26.00 37.00
2 (1008.5, 1012.25] 29.50 29.25 21.25 26.50
3 (1012.25, 1016.0] 28.25 44.50 26.50 26.50
If you only have integers and want to round the bins:
n_bins = 4
bins = pd.IntervalIndex(pd.cut(df['position'], bins=n_bins))
bins = (np.ceil(bins.left).astype('Int64').astype(str)
+'−'+
np.floor(bins.right).astype('Int64').astype(str)
)
out = (df
.drop(columns='position')
.groupby(bins)
.mean().reset_index()
)
Output:
index sample_1 sample_2 sample_3 sample_4
0 1001−1004 23.00 43.50 26.50 40.25
1 1005−1008 23.50 34.75 26.00 37.00
2 1009−1012 29.50 29.25 21.25 26.50
3 1013−1016 28.25 44.50 26.50 26.50