I have a dataframe which has the following columns:
region_id
, name
, parent
, parent_name
, t2m
, d2m
, and tp
.
I want to group and aggregate column values in a specific way. To enable that, I have defined the following lists:
w_params = ['t2m', 't2m', 't2m', 'd2m', 'tp']
operation = ['max', 'min', 'mean', 'mean', 'sum']
common_cols = ['region_id', 'name', 'parent', 'parent_name']
I have written the function agg_daily
to group column values by date
and region_id
and aggregate.
def agg_daily(df, common_cols, w_params, operation):
"""
Aggregate the data for each day.
Parameters
----------
df : pandas dataframe
Dataframe containing daily data.
Returns
-------
agg_daily_df : pandas dataframe
Dataframe containing aggregated data for each day.
"""
agg_daily_df = df.groupby(['date', 'region_id']).agg(
name=('name', 'first'),
parent=('parent', 'first'),
parent_name=('parent_name', 'first'),
t2m_max=('t2m', 'max'),
t2m_min=('t2m', 'min'),
t2m_mean=('t2m', 'mean'),
d2m=('d2m', 'mean'),
tp=('tp', 'sum')
).reset_index()
agg_daily_df = agg_daily_df.sort_values(['region_id', 'date'], ascending=[True, True]).reset_index(drop=True)
return agg_daily_df
However, notice inside agg_daily
that the arguments within agg
, e.g. t2m_max
, t2m_min
, t2m_mean
are hard coded. Instead, I want to pass common_cols
, w_params
, operation
as arguments to agg_daily
, avoid the hard coding, and yet get the function agg_daily
perform the desired operation.
Note that for columns belonging to common_cols
, I do not wish to create a new column name in the final output. However, for the columns belonging to w_params
, I want to create a column corresponding to the operation
being performed.
Can anyone help me get a customizable function?
unpack a dictionary created from the pairing of w_params and operation within the named aggregation:
def agg_daily(df, common_cols, w_params, operation):
mapped = zip(w_params, operation)
mapped = {f"{col}_{func}": (col, func) for col, func in mapped}
outcome = df.groupby(common_cols, as_index=False).agg(**mapped)
return outcome
Application:
data = {'model': {0: 'Mazda RX4', 1: 'Mazda RX4 Wag', 2: 'Datsun 710'},
'mpg': {0: 21.0, 1: 21.0, 2: 22.8},
'cyl': {0: 6, 1: 6, 2: 4},
'disp': {0: 160.0, 1: 160.0, 2: 108.0},
'hp': {0: 110, 1: 110, 2: 93},
'drat': {0: 3.9, 1: 3.9, 2: 3.85},
'wt': {0: 2.62, 1: 2.875, 2: 2.32},
'qsec': {0: 16.46, 1: 17.02, 2: 18.61},
'vs': {0: 0, 1: 0, 2: 1},
'am': {0: 1, 1: 1, 2: 1},
'gear': {0: 4, 1: 4, 2: 4},
'carb': {0: 4, 1: 4, 2: 1}}
mtcars = pd.DataFrame(data)
agg_daily(df=mtcars,
common_cols='cyl',
w_params=['disp','hp','drat'],
operation=['min','max','min'])
cyl disp_min hp_max drat_min
0 4 108.0 93 3.85
1 6 160.0 110 3.90
ideally, you'll add some checks - w_params should be same length as operation, the entries in operation should be strings (if they are not you have to consider how to grab the names - .__name__()
possibly), ...