Search code examples
pythonrandom-forestevaluationgrid-search

Why is "Mae" increasing when tuning a model RFR model


I have a problem where the Mean absolute error increases when tuning the parameters for RandomForestRegressor. I have set the scoring to neg_mean_absolute_error but for some reason it still increases?

My dataset contains 100.000 observations across 300 variables where i have used a train/test split with a test_size=0.2.

I have tried 200 combinations with randomgridsearchcv where i set scoring=neg_mean_absolute_error. When measuring the MAE on the test data i get a mae=6500 (default RFR model) and on the tuned model i get an mae=9000. shouldn't it decrease or at least stay the same? it seems like it is underfitting the model when tuning it

The code I've used to tune the model looks like this:

max_features=['auto','sqrt']
min_samples_split = [2,5,10,20,30,40]
min_samples_leaf = [5,10,20,50,60,80]
max_depth = [int(x) for x in np.linspace(5, 200, num = 20)]

random_grid = {'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf
               }

kf = KFold(n_splits=3, random_state=1)

rfr=RandomForestRegressor(n_estimators=100)
rfr_random=RandomizedSearchCV(estimator = rfr,
                              param_distributions=random_grid,
                              n_iter=200,
                              cv=kf,
                              n_jobs=-1,
                              random_state=53,
                              scoring='neg_mean_absolute_error')

rfr_random.fit(x_train,y_train)

RF=RandomForestRegressor(**rfr_random.best_params_)
RF.fit(x_train,y_train)

y_pred=RF.predict(x_test)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred)) 

Can anyone explain why the MAE increases when optimizing the initial model?


Solution

  • This can happen.

    You are training on the trainset, this does not mean that it fits good for the test set. Did you predict it for the training set?

    y_pred_train=RF.predict(x_train)
    print('Mean Absolute Error (Train):', metrics.mean_absolute_error(y_train, y_pred_train)) 
    

    If this error is quite small, you have done a overfit! That means that you have a 'perfect' prediction for your train data, but does not work for your test data.

    in your case you could try a : k-fold cross valdiation. This basically will try several train/test-splits to find the best prediction.

    Also good for you devide your dataset it in train, dev and test set. (test and dev size = 0.2 in total e.g.). then you do training, try it on dev set, tune training again, try it on dev set and after you have a good result, roll it out on test set, and then you see if it was really good!