I want to bin data and select a specific aggregate for each bin.
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [1, 2, 3, 4],
})
groups = pd.cut(df['A'], bins=2, labels=False)
group_reps = df.groupby([groups]).agg(A=('A', 'mean'))
# ... some magic happens here to replace values in A by group_reps ...
#
# expected result
# A, B
# 1.5, 1
# 1.5, 2
# 3.5, 3
# 3.5, 4
How can this be implemented efficiently for data of size close to machine memory?
If you want to alter one column, you can just handle it separately. Also, transform
helps you align the aggregation with the original index:
df['A'] = df['A'].groupby(groups).transform('mean')