Search code examples
pythonpandasdataframescikit-learntime-series

Sktime TimeSeriesSplit with pandas dataframe


I try to use cross-validationwith a timeseries for a pandas dataframe with the sktime TimeSeriesSplit. The dataframe df has a daily format:

   timepoint  balance
0  2017-03-01    1.0
1  2017-04-01    0.0
2  2017-05-01    2.0
3  2017-06-01    3.0
4  2017-07-01    0.0
...

I try to use prophet and run the following code:

#Packages
from sktime.forecasting.fbprophet import Prophet
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
import numpy as np

#preperation
tscv = TimeSeriesSplit()
rmse = []
model_ph = Prophet()

#function
for train_index, test_index in tscv.split((df)):
    cv_train, cv_test = df.iloc[train_index], df.iloc[test_index]
    ph= model_ph.fit(cv_train)
    predictions = model_ph.predict(cv_test.index.values[0], cv_test.index.values[-1])
    true_values = cv_test.values
    rmse.append(sqrt(mean_squared_error(true_values, predictions)))

#print
print("RMSE: {}".format(np.mean(rmse)))

which leads to the following error:

TypeError: X must be either None, or in an sktime compatible format, of scitype 
Series, Panel or Hierarchical, for instance a pandas.DataFrame with sktime 
compatible time indices...

I would have expected outputs for the mean_squared_error


Solution

  • The problems occurs, as sktime prophet only always specific input data. In my case the solution was to create a pd.date_rangeas input for the prediction:

    for train_index, test_index in tscv.split((df)):
        cv_train, cv_test = df.iloc[train_index], df.iloc[test_index]
        ph= model_ph.fit(cv_train)
        forecast = model_ph.predict(fh= pd.date_range(cv_test['timepoint'].values[0], periods=len(cv_test), freq='D'))
        predictions= forecast['balance'].values
        true_values = cv_test['balance'].values
        rmse.append(sqrt(mean_squared_error(true_values, predictions)))