Search code examples
pythonxgboost

Training loop for XGBoost in different dataset


I have developed some different datasets and I want to write a for loop to do the training for each of which and at the end, I want to have RMSE for each dataset. I tried by passing through a for loop but it does not work since it gives back the same value for each dataset while I know that it should be different. The code that I have written is below:

for i in NEW_middle_index:
    DF = df1.iloc[i-100:i+100,:]
    # Append an empty sublist inside the list
    FINAL_DF.append(DF)
      
    y = DF.iloc[:,3]
    X = DF.drop(columns='Target')
    

    index_train = int(0.7 * len(X))


    X_train = X[:index_train]
    y_train = y[:index_train]

    X_test = X[index_train:]
    y_test = y[index_train:]
    
    scaler_x = MinMaxScaler().fit(X_train)
    X_train = scaler_x.transform(X_train)
    X_test  = scaler_x.transform(X_test)

xgb_r = xg.XGBRegressor(objective ='reg:linear',
                    n_estimators = 20, seed = 123)
for i in range(len(NEW_middle_index)):
#     print(i)
  
    # Fitting the model
    xgb_r.fit(X_train,y_train)

    # Predict the model
    pred = xgb_r.predict(X_test)

    # RMSE Computation
    rmse = np.sqrt(mean_squared_error(y_test,pred))
    # print(rmse)
    RMSE.append(rmse)


Solution

  • Not sure if you indented it correctly. You are overwriting X_train and X_test and when you fit your model, its always on the same dataset, hence you get the same results.

    One option is to fit the model once you create the train / test dataframes. Else if you want to keep the train / test set, maybe something like below, to store them in a list of dictionaries, without changing too much of your code:

    import pandas as pd
    import numpy as np
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.metrics import mean_squared_error
    import xgboost as xg
    
    df1 = pd.DataFrame(np.random.normal(0,1,(600,3)))
    df1['Target'] = np.random.uniform(0,1,600)
    
    NEW_middle_index = [100,300,500]
    NEWDF = []
    for i in NEW_middle_index:
        
        y = df1.iloc[i-100:i+100:,3]
        X = df1.iloc[i-100:i+100,:].drop(columns='Target')
        
        index_train = int(0.7 * len(X))
        scaler_x = MinMaxScaler().fit(X)
    
        X_train = scaler_x.transform(X[:index_train])
        y_train = y[:index_train]
    
        X_test = scaler_x.transform(X[index_train:])
        y_test = y[index_train:]
        
        NEWDF.append({'X_train':X_train,'y_train':y_train,'X_test':X_test,'y_test':y_test})
    

    Then we fit and calculate RMSE:

    RMSE = []
    xgb_r = xg.XGBRegressor(objective ='reg:linear',n_estimators = 20, seed = 123)
    
    for i in range(len(NEW_middle_index)):
    
        xgb_r.fit(NEWDF[i]['X_train'],NEWDF[i]['y_train'])
        pred = xgb_r.predict(NEWDF[i]['X_test'])
        rmse = np.sqrt(mean_squared_error(NEWDF[i]['y_test'],pred))
        RMSE.append(rmse)
        
    RMSE
    [0.3524827559800294, 0.3098101362502435, 0.3843173269966071]