Search code examples
pandasgroupingaggregation

How to customize column names while grouping and aggregating?


I have a dataframe which has the following columns: region_id, name, parent, parent_name, t2m, d2m, and tp.

I want to group and aggregate column values in a specific way. To enable that, I have defined the following lists:

w_params = ['t2m', 't2m', 't2m', 'd2m', 'tp']
operation = ['max', 'min', 'mean', 'mean', 'sum']
common_cols = ['region_id', 'name', 'parent', 'parent_name']

I have written the function agg_daily to group column values by date and region_id and aggregate.

def agg_daily(df, common_cols, w_params, operation):
    """
    Aggregate the data for each day.

    Parameters
    ----------
    df : pandas dataframe
        Dataframe containing daily data.

    Returns
    -------
    agg_daily_df : pandas dataframe
        Dataframe containing aggregated data for each day.

    """
    agg_daily_df = df.groupby(['date', 'region_id']).agg(
        name=('name', 'first'),
        parent=('parent', 'first'),
        parent_name=('parent_name', 'first'),
        t2m_max=('t2m', 'max'),
        t2m_min=('t2m', 'min'),
        t2m_mean=('t2m', 'mean'),
        d2m=('d2m', 'mean'),
        tp=('tp', 'sum')
    ).reset_index()
    agg_daily_df = agg_daily_df.sort_values(['region_id', 'date'], ascending=[True, True]).reset_index(drop=True)
    return agg_daily_df

However, notice inside agg_daily that the arguments within agg, e.g. t2m_max, t2m_min, t2m_mean are hard coded. Instead, I want to pass common_cols, w_params, operation as arguments to agg_daily, avoid the hard coding, and yet get the function agg_daily perform the desired operation.

Note that for columns belonging to common_cols, I do not wish to create a new column name in the final output. However, for the columns belonging to w_params, I want to create a column corresponding to the operation being performed.

Can anyone help me get a customizable function?


Solution

  • unpack a dictionary created from the pairing of w_params and operation within the named aggregation:

    def agg_daily(df, common_cols, w_params, operation):
        mapped = zip(w_params, operation)
        mapped = {f"{col}_{func}": (col, func) for col, func in mapped}
        outcome = df.groupby(common_cols, as_index=False).agg(**mapped)
        return outcome
    

    Application:

    data = {'model': {0: 'Mazda RX4', 1: 'Mazda RX4 Wag', 2: 'Datsun 710'},
     'mpg': {0: 21.0, 1: 21.0, 2: 22.8},
     'cyl': {0: 6, 1: 6, 2: 4},
     'disp': {0: 160.0, 1: 160.0, 2: 108.0},
     'hp': {0: 110, 1: 110, 2: 93},
     'drat': {0: 3.9, 1: 3.9, 2: 3.85},
     'wt': {0: 2.62, 1: 2.875, 2: 2.32},
     'qsec': {0: 16.46, 1: 17.02, 2: 18.61},
     'vs': {0: 0, 1: 0, 2: 1},
     'am': {0: 1, 1: 1, 2: 1},
     'gear': {0: 4, 1: 4, 2: 4},
     'carb': {0: 4, 1: 4, 2: 1}}
    
    mtcars = pd.DataFrame(data)
    agg_daily(df=mtcars, 
              common_cols='cyl', 
              w_params=['disp','hp','drat'], 
              operation=['min','max','min'])
       cyl  disp_min  hp_max  drat_min
    0    4     108.0      93      3.85
    1    6     160.0     110      3.90
    

    ideally, you'll add some checks - w_params should be same length as operation, the entries in operation should be strings (if they are not you have to consider how to grab the names - .__name__() possibly), ...