Search code examples
pandasrowcalculation

Pandas Calculate RMSE in Date Range Chunks by Year


I have data in a df and need to calculate the RMSE of a column consisting of rows of months and years data compared to the current month and year rows in a chunk period. I cannot figure out how to set up the sequencing by each year. For example, I need to calculate the RMSE by year from exactly month == 5 to month == 2 and print all the RMSE values in the "Variation" column by start year. My data looks like this:

                     month  mean_mon_flow  ...  std_anomaly  Variation
date                                       ...                        
1992-04-01 00:00:00      4      12.265100  ...    -1.074586        NaN
1992-05-01 00:00:00      5      12.533220  ...    -1.017388   0.057198
1992-06-01 00:00:00      6      12.491247  ...    -1.117406  -0.100018
1992-07-01 00:00:00      7      12.113165  ...    -1.401221  -0.283815
1992-08-01 00:00:00      8      11.846904  ...    -1.359026   0.042195
1992-09-01 00:00:00      9      11.526178  ...    -0.299250   1.059776
1992-10-01 00:00:00     10      11.555834  ...    -0.628162  -0.328911
1992-11-01 00:00:00     11      11.746104  ...    -1.116374  -0.488213
1992-12-01 00:00:00     12      11.891824  ...    -0.143343   0.973031
1993-01-01 00:00:00      1      11.997252  ...    -0.486450  -0.343107
1993-02-01 00:00:00      2      12.028855  ...    -0.862971  -0.376521
1993-03-01 00:00:00      3      12.063974  ...    -0.596869   0.266102
1993-04-01 00:00:00      4      12.265100  ...    -0.923695  -0.326826
1993-05-01 00:00:00      5      12.533220  ...     0.322987   1.246682
1993-06-01 00:00:00      6      12.491247  ...    -0.478567  -0.801554
1993-07-01 00:00:00      7      12.113165  ...    -0.274119   0.204448
1993-08-01 00:00:00      8      11.846904  ...    -0.707968  -0.433849
1993-09-01 00:00:00      9      11.526178  ...     0.167246   0.875214
1993-10-01 00:00:00     10      11.555834  ...    -0.089410  -0.256656
1993-11-01 00:00:00     11      11.746104  ...    -1.046461  -0.957050
1993-12-01 00:00:00     12      11.891824  ...    -1.293175  -0.246714
1994-01-01 00:00:00      1      11.997252  ...    -1.505133  -0.211959
1994-02-01 00:00:00      2      12.028855  ...    -0.610121   0.895012
1994-03-01 00:00:00      3      12.063974  ...    -0.974184  -0.364063
1994-04-01 00:00:00      4      12.265100  ...    -1.077609  -0.103424

The observed data from the current year looks like this:

                     month  mean_mon_flow  ...  std_anomaly  Variation
date                                       ...                        
2021-05-01 00:00:00      5      12.533220  ...    -0.935899   0.206586
2021-06-01 00:00:00      6      12.491247  ...    -0.647261   0.288638
2021-07-01 00:00:00      7      12.113165  ...    -0.711730  -0.064469
2021-08-01 00:00:00      8      11.846904  ...    -0.482306   0.229424
2021-09-01 00:00:00      9      11.526178  ...    -0.116989   0.365317
2021-10-01 00:00:00     10      11.555834  ...     0.319614   0.436603
2021-11-01 00:00:00     11      11.746104  ...     0.880379   0.560765
2021-12-01 00:00:00     12      11.891824  ...     0.630541  -0.249838
2022-01-01 00:00:00      1      11.997252  ...    -0.151507  -0.782048
2022-02-01 00:00:00      2      12.028855  ...    -0.237398  -0.085891

The result should be something like this below. I've tried using a groupby statement to calculate RMSE but not sure how to give groupby a range of dates.

year   RMSE Variation
1992   number
1993   number
1994   number 
..     ..
2020   number

thank you,


Solution

  • Some pre-processing of your dataframe for previous years. First, get the year label by taking the year component of your date with 4-month subtracted. Second, drop March and April.

    from datetime import date
    from dateutil.relativedelta import relativedelta
    df_prev['year'] = pd.Series(df_prev['date'].dt.to_pydatetime() - relativedelta(months=4)).dt.year
    
    df_prev = df_prev[~df_prev['month'].isin([3,4])]
    

    Then convert df_prev into a matrix with years as column and month as index, convert the table for this year into a series with month as index.

    df_prev_vari = df_prev.set_index(['month', 'year'])[['Variation']].unstack().droplevel(0, axis=1)
    df_this_vari = df_this.set_index('month')['Variation']
    

    Having month as the common index for both data enables us to subtract one another by matching the index, followed by squared, mean, and square-root operations.

    (df_prev_vari.sub(df_this_vari, axis=0)**2).mean()**.5