Search code examples
pythonpandaspandas-groupbyforecastingpandas-apply

Pandas: custom WMAPE function aggregation function to multiple columns without for-loop?


Objective: group pandas dataframe using a custom WMAPE (Weighted Mean Absolute Percent Error) function on multiple forecast columns and one actual data column, without for-loop. I know a for-loop & merges of output dataframes will do the trick. I want to do this efficiently.

Have: WMAPE function, successful use of WMAPE function on one forecast column of dataframe. One column of actual data, variable number of forecast columns.

Input Data: Pandas DataFrame with several categorical columns (City, Person, DT, HOUR), one actual data column (Actual), and four forecast columns (Forecast_1 ... Forecast_4). See link for csv: https://www.dropbox.com/s/tidf9lj80a1dtd8/data_small_2.csv?dl=1

Need: WMAPE function applied during groupby on multiple columns with a list of forecast columns fed into groupby line.

Output Desired: An output dataframe with categorical groups columns and all columns of WMAPE. Labeling is preferred but not needed (output image below).

Successful Code so far: Two WMAPE functions: one to take two series in & output a single float value (wmape), and one structured for use in a groupby (wmape_gr):

def wmape(actual, forecast):
    # we take two series and calculate an output a wmape from it

    # make a series called mape
    se_mape = abs(actual-forecast)/actual

    # get a float of the sum of the actual
    ft_actual_sum = actual.sum()

    # get a series of the multiple of the actual & the mape
    se_actual_prod_mape = actual * se_mape

    # summate the prod of the actual and the mape
    ft_actual_prod_mape_sum = se_actual_prod_mape.sum()

    # float: wmape of forecast
    ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum

    # return a float
    return ft_wmape_forecast

def wmape_gr(df_in, st_actual, st_forecast):
    # we take two series and calculate an output a wmape from it

    # make a series called mape
    se_mape = abs(df_in[st_actual] - df_in[st_forecast]) / df_in[st_actual]

    # get a float of the sum of the actual
    ft_actual_sum = df_in[st_actual].sum()

    # get a series of the multiple of the actual & the mape
    se_actual_prod_mape = df_in[st_actual] * se_mape

    # summate the prod of the actual and the mape
    ft_actual_prod_mape_sum = se_actual_prod_mape.sum()

    # float: wmape of forecast
    ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum

    # return a float
    return ft_wmape_forecast

# read in data directly from Dropbox
df = pd.read_csv('https://www.dropbox.com/s/tidf9lj80a1dtd8/data_small_2.csv?dl=1',sep=",",header=0)

# grouping with 3 columns. wmape_gr uses the Actual column, and Forecast_1 as inputs
df_gr = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_1')

Output Looks Like (first two rows):

enter image description here

Desired output would have all forecasts in one shot (dummy data for Forecast_2 ... Forecast_4). I can already do this with a for-loop. I just want to do it within the groupby. I want to call a wmape function four times. I would appreciate any assistance.


Solution

  • If you modify wmape to work with arrays using broadcasting, then you can do it in one shot:

    def wmape(actual, forecast):
        # Take a series (actual) and a dataframe (forecast) and calculate wmape
        # for each forecast. Output shape is (1, num_forecasts)
    
        # Convert to numpy arrays for broadasting
        forecast = np.array(forecast.values)
        actual=np.array(actual.values).reshape((-1, 1))
    
        # Make an array of mape (same shape as forecast)
        se_mape = abs(actual-forecast)/actual
    
        # Calculate sum of actual values
        ft_actual_sum = actual.sum(axis=0)
    
        # Multiply the actual values by the mape
        se_actual_prod_mape = actual * se_mape
    
        # Take the sum of the product of actual values and mape
        # Make sure to sum down the rows (1 for each column)
        ft_actual_prod_mape_sum = se_actual_prod_mape.sum(axis=0)
    
        # Calculate the wmape for each forecast and return as a dictionary
        ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
        return {f'Forecast_{i+1}_wmape': wmape for i, wmape in enumerate(ft_wmape_forecast)}
    

    Then use apply on the proper columns:

    # Group the dataframe and apply the function to appropriate columns
    new_df = df.groupby(['City', 'Person', 'DT']).apply(lambda x: wmape(x['Actual'], 
                                            x[[c for c in x if 'Forecast' in c]])).\
                to_frame().reset_index()
    

    This results in a dataframe with a single dictionary column. Intermediate Results

    The single column can be converted to multiple columns for the correct format:

    # Convert the dictionary in a single column into 4 columns with proper names
    # and concantenate column-wise
    df_grp = pd.concat([new_df.drop(columns=[0]), 
                        pd.DataFrame(list(new_df[0].values))], axis=1)
    

    Result:

    Result of operations