Search code examples
pythonscikit-learnxgboostforecastingtraining-data

Forecast next day without train and test split


Typically when we have a data frame we split it into train and test. For example, imagine my data frame is something like this:

> df.head()

          Date           y  wind temperature           
1   2019-10-03 00:00:00 33   12    15
2   2019-10-03 01:00:00 10   5     6
3   2019-10-03 02:00:00 39   6     5
4   2019-10-03 03:00:00 60   13    4
5   2019-10-03 04:00:00 21   3     7

I want to predict y based on the wind and temperature. We then do a split something like this:

df_train = df.loc[df.index <= split_date].copy()
df_test = df.loc[df.index > split_date].copy()
X1=df_train[['wind','temperature']]
y1=df_train['y']
X2=df_test[['wind','temperature']]
y2=df_test['y']
from sklearn.model_selection import train_test_split
X_train, y_train =X1, y1
X_test, y_test = X2,y2
model.fit(X_train,y_train)

And we then predict our test data. However, this uses the features of wind and temperature in the test data frame. If I want to predict (unknown) tomorrow y without knowing tomorrow's hourly temperature and wind, does the method no longer work? (For LSTM or XGBoost for example)


Solution

  • The way you train your model, each row is considered an independent sample, regardless of the order, i.e. what values are observed earlier or later. If you have reason to believe that the chronological order is relevant to predicting y from wind speed and temperature you will need to change your model.

    You could try, e.g. to add another column with the values for wind speed and temperature one hour before (shift it by one row), or, if you believe that y might be depend on the weekday, compute the weekday from the date and add that as input feature.