I have developed some different datasets and I want to write a for loop to do the training for each of which and at the end, I want to have RMSE
for each dataset. I tried by passing through a for
loop but it does not work since it gives back the same value for each dataset while I know that it should be different. The code that I have written is below:
for i in NEW_middle_index:
DF = df1.iloc[i-100:i+100,:]
# Append an empty sublist inside the list
FINAL_DF.append(DF)
y = DF.iloc[:,3]
X = DF.drop(columns='Target')
index_train = int(0.7 * len(X))
X_train = X[:index_train]
y_train = y[:index_train]
X_test = X[index_train:]
y_test = y[index_train:]
scaler_x = MinMaxScaler().fit(X_train)
X_train = scaler_x.transform(X_train)
X_test = scaler_x.transform(X_test)
xgb_r = xg.XGBRegressor(objective ='reg:linear',
n_estimators = 20, seed = 123)
for i in range(len(NEW_middle_index)):
# print(i)
# Fitting the model
xgb_r.fit(X_train,y_train)
# Predict the model
pred = xgb_r.predict(X_test)
# RMSE Computation
rmse = np.sqrt(mean_squared_error(y_test,pred))
# print(rmse)
RMSE.append(rmse)
Not sure if you indented it correctly. You are overwriting X_train
and X_test
and when you fit your model, its always on the same dataset, hence you get the same results.
One option is to fit the model once you create the train / test dataframes. Else if you want to keep the train / test set, maybe something like below, to store them in a list of dictionaries, without changing too much of your code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import xgboost as xg
df1 = pd.DataFrame(np.random.normal(0,1,(600,3)))
df1['Target'] = np.random.uniform(0,1,600)
NEW_middle_index = [100,300,500]
NEWDF = []
for i in NEW_middle_index:
y = df1.iloc[i-100:i+100:,3]
X = df1.iloc[i-100:i+100,:].drop(columns='Target')
index_train = int(0.7 * len(X))
scaler_x = MinMaxScaler().fit(X)
X_train = scaler_x.transform(X[:index_train])
y_train = y[:index_train]
X_test = scaler_x.transform(X[index_train:])
y_test = y[index_train:]
NEWDF.append({'X_train':X_train,'y_train':y_train,'X_test':X_test,'y_test':y_test})
Then we fit and calculate RMSE:
RMSE = []
xgb_r = xg.XGBRegressor(objective ='reg:linear',n_estimators = 20, seed = 123)
for i in range(len(NEW_middle_index)):
xgb_r.fit(NEWDF[i]['X_train'],NEWDF[i]['y_train'])
pred = xgb_r.predict(NEWDF[i]['X_test'])
rmse = np.sqrt(mean_squared_error(NEWDF[i]['y_test'],pred))
RMSE.append(rmse)
RMSE
[0.3524827559800294, 0.3098101362502435, 0.3843173269966071]