Suppose I have a pandas data frame, vols
where
vols.head()
Return Vol
DataDate
2019-12-26 0.002291 0.002400
2019-12-27 0.002292 0.002392
2019-12-30 0.002288 0.002385
2019-12-31 0.002288 0.002378
2020-01-01 0.002286 0.002378
Next I rename vols
columns.
vols.columns = ['Realized', 'Predicted']
vols.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 922 entries, 2019-12-26 to 2023-07-27
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Realized 922 non-null float64
1 Predicted 922 non-null float64
dtypes: float64(2)
I want to calculate rolling Root Mean Square Error.
vols_rolling = vols.rolling(window=52)
from sklearn.metrics import mean_squared_error as mse
vols_rolling.apply(lambda x: mse(x['Realized'], x['Predicted']))
I am getting following ValueError
.
ValueError Traceback (most recent call last)
File ~\anaconda3\Lib\site-packages\pandas\_libs\tslibs\parsing.pyx:440, in pandas._libs.tslibs.parsing.parse_datetime_string_with_reso()
File ~\anaconda3\Lib\site-packages\pandas\_libs\tslibs\parsing.pyx:649, in pandas._libs.tslibs.parsing.dateutil_parse()
ValueError: Unknown datetime string format, unable to parse: Realized
During handling of the above exception, another exception occurred:
The error is quite long. Trying not to copy paste it here.
The issue is that rolling.apply
works per column, but your function need to access two columns simultaneously.
You can cheat and use one Series to retrieve the index and slice the external DataFrame:
from sklearn.metrics import mean_squared_error as mse
vols_rolling = vols.rolling(window=52, min_periods=1)
vols_rolling['Realized'].apply(lambda x: mse(vols.loc[x.index, 'Realized'], vols.loc[x.index, 'Predicted']))
Output:
DataDate
2019-12-26 1.188100e-08
2019-12-27 1.094050e-08
2019-12-30 1.043000e-08
2019-12-31 9.847500e-09
2020-01-01 9.570800e-09
Name: Realized, dtype: float64