Search code examples
pythonpandasdatetimecalculated-columns

RMSE between all value with same hours with pandas


I have two dataframe: the first represents the output of a model simulation and the second the real value. I would like to compute the RMSE between all the value with the same hours. Basically I should compute 24 RMSE value, one for each hour.

These are the first columns of my dataframes:

date;model
2017-01-01 00:00:00;53
2017-01-01 01:00:00;52
2017-01-01 02:00:00;51
2017-01-01 03:00:00;47.27
2017-01-01 04:00:00;45.49
2017-01-01 05:00:00;45.69
2017-01-01 06:00:00;48.07
2017-01-01 07:00:00;45.67
2017-01-01 08:00:00;45.48
2017-01-01 09:00:00;42.06
2017-01-01 10:00:00;46.86
2017-01-01 11:00:00;48.02
2017-01-01 12:00:00;49.57
2017-01-01 13:00:00;48.69
2017-01-01 14:00:00;46.91
2017-01-01 15:00:00;49.43
2017-01-01 16:00:00;50.45
2017-01-01 17:00:00;53.3
2017-01-01 18:00:00;59.07
2017-01-01 19:00:00;61.71
2017-01-01 20:00:00;56.26
2017-01-01 21:00:00;55
2017-01-01 22:00:00;54
2017-01-01 23:00:00;52
2017-01-02 00:00:00;53

and

date;real
2017-01-01 00:00:00;55
2017-01-01 01:00:00;55
2017-01-01 02:00:00;55
2017-01-01 03:00:00;48.27
2017-01-01 04:00:00;48.49
2017-01-01 05:00:00;48.69
2017-01-01 06:00:00;49.07
2017-01-01 07:00:00;49.67
2017-01-01 08:00:00;49.48
2017-01-01 09:00:00;50.06
2017-01-01 10:00:00;50.86
2017-01-01 11:00:00;50.02
2017-01-01 12:00:00;33.57
2017-01-01 13:00:00;33.69
2017-01-01 14:00:00;33.91
2017-01-01 15:00:00;33.43
2017-01-01 16:00:00;33.45
2017-01-01 17:00:00;33.3
2017-01-01 18:00:00;33.07
2017-01-01 19:00:00;33.71
2017-01-01 20:00:00;33.26
2017-01-01 21:00:00;33
2017-01-01 22:00:00;33
2017-01-01 23:00:00;33
2017-01-02 00:00:00;33

due to the fact that I am considering one year, I have to consider 365 value for each RMSE computation.

Up to now, I able only to read the dataframes. One option could be to set-up a cycle between 1-24 and to try do create 24 new dataframes by means of dfr[dfr.index.hour == i-th hours].

Do you have some more elegant and efficient solution?

Thanks


Solution

  • RMSE depends on the pairing order so you should join the model to the real data first, then group by hour and calculate your RMSE:

    def rmse(group):
        if len(group) == 0:
            return np.nan
        
        s = (group['model'] - group['real']).pow(2).sum()
        return np.sqrt(s / len(group))
        
    result = (
        df1.merge(df2, on='date')
            .assign(hour=lambda x: x['date'].dt.hour)
            .groupby('hour')
            .apply(rmse)
    )
    

    Result:

    hour
    0     14.21267
    1      3.00000
    2      4.00000
    3      1.00000
    4      3.00000
    5      3.00000
    6      1.00000
    7      4.00000
    8      4.00000
    9      8.00000
    10     4.00000
    11     2.00000
    12    16.00000
    13    15.00000
    14    13.00000
    15    16.00000
    16    17.00000
    17    20.00000
    18    26.00000
    19    28.00000
    20    23.00000
    21    22.00000
    22    21.00000
    23    19.00000
    dtype: float64
    

    Explanation

    Here what the code does:

    • merge: combine the two data frames together based on the date index
    • assign: create a new column hour, extracted from the date index
    • groupby: group rows based on their hour values

    apply allows you to write a custom aggregator. All the rows with hour = 0 will be sent into the rmse function (our custom function), all the rows with hour = 1 will be sent next. As an illustration:

    date                 hour  model  real
    2017-01-01 00:00:00  0     ...    ...
    2017-01-02 00:00:00  0     ...    ...
    2017-01-03 00:00:00  0     ...    ...
    2017-01-04 00:00:00  0     ...    ...
    --------------------------------------
    2017-01-01 01:00:00  1     ...    ...
    2017-01-02 01:00:00  1     ...    ...
    2017-01-03 01:00:00  1     ...    ...
    2017-01-04 01:00:00  1     ...    ...
    --------------------------------------
    2017-01-01 02:00:00  2     ...    ...
    2017-01-02 02:00:00  2     ...    ...
    2017-01-03 02:00:00  2     ...    ...
    2017-01-04 02:00:00  2     ...    ...
    --------------------------------------
    2017-01-01 03:00:00  3     ...    ...
    2017-01-02 03:00:00  3     ...    ...
    2017-01-03 03:00:00  3     ...    ...
    2017-01-04 03:00:00  3     ...    ...
    

    Each chunk is then sent to our custom function: rmse(group=<a chunk>). Within the function, we reduce that chunk down into a single number: its RMSE. That's how you get the 24 RMSE numbers back as a result.