I am trying to do forecasting based on time series. I am doing temperature forecasting by using the past three years of hourly data.
Instead of using X_test
from train_test_split
method, I am using my own test dataset because I need seven-day ahead forecasting.
Problem: When I am using dummy Test data set for forecasting it’s giving incorrect values. But when I using Test data set from train_test_split
method, then it’s giving accurate values. I don’t understand why this is happening.
What I tried to fix this problem: First, I thought this is happening because I am not applying feature scaling but after implementing feature scaling the results are same. Then I thought, when train_test_split
split the data it also gives some randomness to data so I applied randomness on my dummy Test data but still, results are the same.
My question: How can I apply different dataframe for testing a model? And how did I get accurate results?
Program:
df = pd.read_csv("Timeseries_47.999_7.850_SA_0deg_0deg_2013_2016.csv", sep=",")
time_mod = []
for i in range(0,len(df['time'])):
ss=pd.to_datetime(df['time'][i], format= "%Y%m%d:%H%M")
time_mod.append(ss)
df['datetime'] = time_mod
df["Hour"] = pd.to_datetime(df["datetime"]).dt.hour
df["Month"] = pd.to_datetime(df["datetime"]).dt.month
df["Day_of_year"] = pd.to_datetime(df["datetime"]).dt.dayofyear
df["Day_of_month"] = pd.to_datetime(df["datetime"]).dt.day
df["week_of_year"] = pd.to_datetime(df["datetime"]).dt.week
X = df[{"Hour", "Day_of_year", "Day_of_month", 'week_of_year', 'Month'}].values
y = df[{"T2m"}].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)
## Creating dummy datetime for Test data
df.set_index('datetime',inplace=True)
future_dates = [df.index[-1]+DateOffset(hours=x) for x in range(0,168)]
future_dates_df = pd.DataFrame({'Data':future_dates})
future_dates_df["Hour"] = pd.to_datetime(future_dates_df["Data"]).dt.hour
future_dates_df["Month"] = pd.to_datetime(future_dates_df["Data"]).dt.month
future_dates_df["Day_of_year"] = future_dates_df["Data"].dt.dayofyear
future_dates_df["Day_of_month"] = pd.to_datetime(future_dates_df["Data"]).dt.day
future_dates_df["Date"] = pd.to_datetime(future_dates_df["Data"]).dt.date
future_dates_df["week_of_year"] = future_dates_df["Data"].dt.week
X_test_dum = future_dates_df[["Hour",'Month','Day_of_year','week_of_year','Day_of_month']].values
#Model
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test_dum)
plt.plot(y_test, color="r", label="actual")
plt.plot(y_pred, label="forecasted")
sns.set(rc={'figure.figsize':(20,10)})
plt.legend()
plt.show()
The reason behind getting inaccurate could be:
X = df[{"Hour", "Day_of_year", "Day_of_month", 'week_of_year', 'Month'}].values
X_test_dum = future_dates_df[["Hour",'Month','Day_of_year','week_of_year','Day_of_month']].values
Linear Regression
but data does not look like linear. Try Polynomial Regression
, Decision Tree
, Random Forest
or the model which is good with non-linear data.cs
v file and then separate train and test dataset in python.