I have a problem where the Mean absolute error increases when tuning the parameters for RandomForestRegressor. I have set the scoring to neg_mean_absolute_error
but for some reason it still increases?
My dataset contains 100.000 observations across 300 variables where i have used a train/test split with a test_size=0.2
.
I have tried 200 combinations with randomgridsearchcv where i set scoring=neg_mean_absolute_error
. When measuring the MAE on the test data i get a mae=6500
(default RFR model) and on the tuned model i get an mae=9000
. shouldn't it decrease or at least stay the same? it seems like it is underfitting the model when tuning it
The code I've used to tune the model looks like this:
max_features=['auto','sqrt']
min_samples_split = [2,5,10,20,30,40]
min_samples_leaf = [5,10,20,50,60,80]
max_depth = [int(x) for x in np.linspace(5, 200, num = 20)]
random_grid = {'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf
}
kf = KFold(n_splits=3, random_state=1)
rfr=RandomForestRegressor(n_estimators=100)
rfr_random=RandomizedSearchCV(estimator = rfr,
param_distributions=random_grid,
n_iter=200,
cv=kf,
n_jobs=-1,
random_state=53,
scoring='neg_mean_absolute_error')
rfr_random.fit(x_train,y_train)
RF=RandomForestRegressor(**rfr_random.best_params_)
RF.fit(x_train,y_train)
y_pred=RF.predict(x_test)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
Can anyone explain why the MAE increases when optimizing the initial model?
This can happen.
You are training on the trainset, this does not mean that it fits good for the test set. Did you predict it for the training set?
y_pred_train=RF.predict(x_train)
print('Mean Absolute Error (Train):', metrics.mean_absolute_error(y_train, y_pred_train))
If this error is quite small, you have done a overfit! That means that you have a 'perfect' prediction for your train data, but does not work for your test data.
in your case you could try a : k-fold cross valdiation. This basically will try several train/test-splits to find the best prediction.
Also good for you devide your dataset it in train, dev and test set. (test and dev size = 0.2 in total e.g.). then you do training, try it on dev set, tune training again, try it on dev set and after you have a good result, roll it out on test set, and then you see if it was really good!