python pandas datetime calculated-columns

RMSE between all value with same hours with pandas

I have two dataframe: the first represents the output of a model simulation and the second the real value. I would like to compute the RMSE between all the value with the same hours. Basically I should compute 24 RMSE value, one for each hour.

These are the first columns of my dataframes:

date;model
2017-01-01 00:00:00;53
2017-01-01 01:00:00;52
2017-01-01 02:00:00;51
2017-01-01 03:00:00;47.27
2017-01-01 04:00:00;45.49
2017-01-01 05:00:00;45.69
2017-01-01 06:00:00;48.07
2017-01-01 07:00:00;45.67
2017-01-01 08:00:00;45.48
2017-01-01 09:00:00;42.06
2017-01-01 10:00:00;46.86
2017-01-01 11:00:00;48.02
2017-01-01 12:00:00;49.57
2017-01-01 13:00:00;48.69
2017-01-01 14:00:00;46.91
2017-01-01 15:00:00;49.43
2017-01-01 16:00:00;50.45
2017-01-01 17:00:00;53.3
2017-01-01 18:00:00;59.07
2017-01-01 19:00:00;61.71
2017-01-01 20:00:00;56.26
2017-01-01 21:00:00;55
2017-01-01 22:00:00;54
2017-01-01 23:00:00;52
2017-01-02 00:00:00;53

and

date;real
2017-01-01 00:00:00;55
2017-01-01 01:00:00;55
2017-01-01 02:00:00;55
2017-01-01 03:00:00;48.27
2017-01-01 04:00:00;48.49
2017-01-01 05:00:00;48.69
2017-01-01 06:00:00;49.07
2017-01-01 07:00:00;49.67
2017-01-01 08:00:00;49.48
2017-01-01 09:00:00;50.06
2017-01-01 10:00:00;50.86
2017-01-01 11:00:00;50.02
2017-01-01 12:00:00;33.57
2017-01-01 13:00:00;33.69
2017-01-01 14:00:00;33.91
2017-01-01 15:00:00;33.43
2017-01-01 16:00:00;33.45
2017-01-01 17:00:00;33.3
2017-01-01 18:00:00;33.07
2017-01-01 19:00:00;33.71
2017-01-01 20:00:00;33.26
2017-01-01 21:00:00;33
2017-01-01 22:00:00;33
2017-01-01 23:00:00;33
2017-01-02 00:00:00;33

due to the fact that I am considering one year, I have to consider 365 value for each RMSE computation.

Up to now, I able only to read the dataframes. One option could be to set-up a cycle between 1-24 and to try do create 24 new dataframes by means of dfr[dfr.index.hour == i-th hours].

Do you have some more elegant and efficient solution?

Thanks

Solution

RMSE depends on the pairing order so you should join the model to the real data first, then group by hour and calculate your RMSE:

def rmse(group):
    if len(group) == 0:
        return np.nan
    
    s = (group['model'] - group['real']).pow(2).sum()
    return np.sqrt(s / len(group))
    
result = (
    df1.merge(df2, on='date')
        .assign(hour=lambda x: x['date'].dt.hour)
        .groupby('hour')
        .apply(rmse)
)

Result:

hour
0     14.21267
1      3.00000
2      4.00000
3      1.00000
4      3.00000
5      3.00000
6      1.00000
7      4.00000
8      4.00000
9      8.00000
10     4.00000
11     2.00000
12    16.00000
13    15.00000
14    13.00000
15    16.00000
16    17.00000
17    20.00000
18    26.00000
19    28.00000
20    23.00000
21    22.00000
22    21.00000
23    19.00000
dtype: float64

Explanation

Here what the code does:

merge: combine the two data frames together based on the date index
assign: create a new column hour, extracted from the date index
groupby: group rows based on their hour values

apply allows you to write a custom aggregator. All the rows with hour = 0 will be sent into the rmse function (our custom function), all the rows with hour = 1 will be sent next. As an illustration:

date                 hour  model  real
2017-01-01 00:00:00  0     ...    ...
2017-01-02 00:00:00  0     ...    ...
2017-01-03 00:00:00  0     ...    ...
2017-01-04 00:00:00  0     ...    ...
--------------------------------------
2017-01-01 01:00:00  1     ...    ...
2017-01-02 01:00:00  1     ...    ...
2017-01-03 01:00:00  1     ...    ...
2017-01-04 01:00:00  1     ...    ...
--------------------------------------
2017-01-01 02:00:00  2     ...    ...
2017-01-02 02:00:00  2     ...    ...
2017-01-03 02:00:00  2     ...    ...
2017-01-04 02:00:00  2     ...    ...
--------------------------------------
2017-01-01 03:00:00  3     ...    ...
2017-01-02 03:00:00  3     ...    ...
2017-01-03 03:00:00  3     ...    ...
2017-01-04 03:00:00  3     ...    ...

Each chunk is then sent to our custom function: rmse(group=<a chunk>). Within the function, we reduce that chunk down into a single number: its RMSE. That's how you get the 24 RMSE numbers back as a result.