python machine-learning forecasting train-test-split

Why results are inaccurate when I am using different dataset for testing a model in Machine Learning?

I am trying to do forecasting based on time series. I am doing temperature forecasting by using the past three years of hourly data. Instead of using X_test from train_test_split method, I am using my own test dataset because I need seven-day ahead forecasting.

Problem: When I am using dummy Test data set for forecasting it’s giving incorrect values. But when I using Test data set from train_test_split method, then it’s giving accurate values. I don’t understand why this is happening.

What I tried to fix this problem: First, I thought this is happening because I am not applying feature scaling but after implementing feature scaling the results are same. Then I thought, when train_test_split split the data it also gives some randomness to data so I applied randomness on my dummy Test data but still, results are the same.

My question: How can I apply different dataframe for testing a model? And how did I get accurate results?

Program:

df = pd.read_csv("Timeseries_47.999_7.850_SA_0deg_0deg_2013_2016.csv", sep=",")

time_mod = []
for i in range(0,len(df['time'])):
    ss=pd.to_datetime(df['time'][i], format= "%Y%m%d:%H%M")
    time_mod.append(ss)
df['datetime'] = time_mod

df["Hour"] = pd.to_datetime(df["datetime"]).dt.hour
df["Month"] = pd.to_datetime(df["datetime"]).dt.month
df["Day_of_year"] = pd.to_datetime(df["datetime"]).dt.dayofyear
df["Day_of_month"] = pd.to_datetime(df["datetime"]).dt.day
df["week_of_year"] = pd.to_datetime(df["datetime"]).dt.week

X = df[{"Hour", "Day_of_year", "Day_of_month", 'week_of_year', 'Month'}].values
y = df[{"T2m"}].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0) 

## Creating dummy datetime for Test data
df.set_index('datetime',inplace=True)
future_dates = [df.index[-1]+DateOffset(hours=x) for x in range(0,168)]
future_dates_df = pd.DataFrame({'Data':future_dates})

future_dates_df["Hour"] = pd.to_datetime(future_dates_df["Data"]).dt.hour
future_dates_df["Month"] = pd.to_datetime(future_dates_df["Data"]).dt.month
future_dates_df["Day_of_year"] = future_dates_df["Data"].dt.dayofyear
future_dates_df["Day_of_month"] = pd.to_datetime(future_dates_df["Data"]).dt.day
future_dates_df["Date"] = pd.to_datetime(future_dates_df["Data"]).dt.date
future_dates_df["week_of_year"] = future_dates_df["Data"].dt.week

X_test_dum = future_dates_df[["Hour",'Month','Day_of_year','week_of_year','Day_of_month']].values

#Model
regressor = LinearRegression()
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test_dum)

plt.plot(y_test, color="r", label="actual")
plt.plot(y_pred, label="forecasted")
sns.set(rc={'figure.figsize':(20,10)})
plt.legend()
plt.show()

Solution

The reason behind getting inaccurate could be:

The dummies dataset variables are not arranged in the same way as actual dataset.

X = df[{"Hour", "Day_of_year", "Day_of_month", 'week_of_year', 'Month'}].values
X_test_dum = future_dates_df[["Hour",'Month','Day_of_year','week_of_year','Day_of_month']].values

I also notice that you are applying Linear Regression but data does not look like linear. Try Polynomial Regression, Decision Tree, Random Forest or the model which is good with non-linear data.
I think eliminating some non-essential independent variables can also improve your results. Only consider: Hour and Day_of_year
Last, try to create dummies dataset directly in csv file and then separate train and test dataset in python.