I have two dataframe: the first represents the output of a model simulation and the second the real value. I would like to compute the RMSE between all the value with the same hours. Basically I should compute 24 RMSE value, one for each hour.
These are the first columns of my dataframes:
date;model
2017-01-01 00:00:00;53
2017-01-01 01:00:00;52
2017-01-01 02:00:00;51
2017-01-01 03:00:00;47.27
2017-01-01 04:00:00;45.49
2017-01-01 05:00:00;45.69
2017-01-01 06:00:00;48.07
2017-01-01 07:00:00;45.67
2017-01-01 08:00:00;45.48
2017-01-01 09:00:00;42.06
2017-01-01 10:00:00;46.86
2017-01-01 11:00:00;48.02
2017-01-01 12:00:00;49.57
2017-01-01 13:00:00;48.69
2017-01-01 14:00:00;46.91
2017-01-01 15:00:00;49.43
2017-01-01 16:00:00;50.45
2017-01-01 17:00:00;53.3
2017-01-01 18:00:00;59.07
2017-01-01 19:00:00;61.71
2017-01-01 20:00:00;56.26
2017-01-01 21:00:00;55
2017-01-01 22:00:00;54
2017-01-01 23:00:00;52
2017-01-02 00:00:00;53
and
date;real
2017-01-01 00:00:00;55
2017-01-01 01:00:00;55
2017-01-01 02:00:00;55
2017-01-01 03:00:00;48.27
2017-01-01 04:00:00;48.49
2017-01-01 05:00:00;48.69
2017-01-01 06:00:00;49.07
2017-01-01 07:00:00;49.67
2017-01-01 08:00:00;49.48
2017-01-01 09:00:00;50.06
2017-01-01 10:00:00;50.86
2017-01-01 11:00:00;50.02
2017-01-01 12:00:00;33.57
2017-01-01 13:00:00;33.69
2017-01-01 14:00:00;33.91
2017-01-01 15:00:00;33.43
2017-01-01 16:00:00;33.45
2017-01-01 17:00:00;33.3
2017-01-01 18:00:00;33.07
2017-01-01 19:00:00;33.71
2017-01-01 20:00:00;33.26
2017-01-01 21:00:00;33
2017-01-01 22:00:00;33
2017-01-01 23:00:00;33
2017-01-02 00:00:00;33
due to the fact that I am considering one year, I have to consider 365 value for each RMSE computation.
Up to now, I able only to read the dataframes. One option could be to set-up a cycle between 1-24 and to try do create 24 new dataframes by means of dfr[dfr.index.hour == i-th hours].
Do you have some more elegant and efficient solution?
Thanks
RMSE depends on the pairing order so you should join the model
to the real
data first, then group by hour and calculate your RMSE:
def rmse(group):
if len(group) == 0:
return np.nan
s = (group['model'] - group['real']).pow(2).sum()
return np.sqrt(s / len(group))
result = (
df1.merge(df2, on='date')
.assign(hour=lambda x: x['date'].dt.hour)
.groupby('hour')
.apply(rmse)
)
Result:
hour
0 14.21267
1 3.00000
2 4.00000
3 1.00000
4 3.00000
5 3.00000
6 1.00000
7 4.00000
8 4.00000
9 8.00000
10 4.00000
11 2.00000
12 16.00000
13 15.00000
14 13.00000
15 16.00000
16 17.00000
17 20.00000
18 26.00000
19 28.00000
20 23.00000
21 22.00000
22 21.00000
23 19.00000
dtype: float64
Here what the code does:
merge
: combine the two data frames together based on the date
indexassign
: create a new column hour
, extracted from the date
indexgroupby
: group rows based on their hour
valuesapply
allows you to write a custom aggregator. All the rows with hour = 0
will be sent into the rmse
function (our custom function), all the rows with hour = 1
will be sent next. As an illustration:
date hour model real
2017-01-01 00:00:00 0 ... ...
2017-01-02 00:00:00 0 ... ...
2017-01-03 00:00:00 0 ... ...
2017-01-04 00:00:00 0 ... ...
--------------------------------------
2017-01-01 01:00:00 1 ... ...
2017-01-02 01:00:00 1 ... ...
2017-01-03 01:00:00 1 ... ...
2017-01-04 01:00:00 1 ... ...
--------------------------------------
2017-01-01 02:00:00 2 ... ...
2017-01-02 02:00:00 2 ... ...
2017-01-03 02:00:00 2 ... ...
2017-01-04 02:00:00 2 ... ...
--------------------------------------
2017-01-01 03:00:00 3 ... ...
2017-01-02 03:00:00 3 ... ...
2017-01-03 03:00:00 3 ... ...
2017-01-04 03:00:00 3 ... ...
Each chunk is then sent to our custom function: rmse(group=<a chunk>)
. Within the function, we reduce that chunk down into a single number: its RMSE. That's how you get the 24 RMSE numbers back as a result.