I am wrapping the Scikit-Learn Random Forest model in a function, as following:
from sklearn.base import BaseEstimator, RegressorMixin
class Model(BaseEstimator, RegressorMixin):
def __init__(self, model):
self.model = model
def fit(self, X, y):
self.model.fit(X, y)
return self
def score(self, X, y):
from sklearn.metrics import mean_squared_error
return mean_squared_error(y_true=y,
y_pred=self.model.predict(X),
squared=False)
def predict(self, X):
return self.model.predict(X)
class RandomForest(Model):
def __init__(self, n_estimators=100,
max_depth=None, min_samples_split=2,
min_samples_leaf=1, max_features=None):
self.n_estimators=n_estimators
self.max_depth=max_depth
self.min_samples_split=min_samples_split
self.min_samples_leaf=min_samples_leaf
self.max_features=max_features
from sklearn.ensemble import RandomForestRegressor
self.model = RandomForestRegressor(n_estimators=self.n_estimators,
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
min_samples_leaf=self.min_samples_leaf,
max_features=self.max_features,
random_state = 777)
def get_params(self, deep=True):
return {"n_estimators": self.n_estimators,
"max_depth": self.max_depth,
"min_samples_split": self.min_samples_split,
"min_samples_leaf": self.min_samples_leaf,
"max_features": self.max_features}
def set_params(self, **parameters):
for parameter, value in parameters.items():
setattr(self, parameter, value)
return self
I mainly follow the Scikit-Learn official guide which can be found at https://scikit-learn.org/stable/developers/develop.html
Here is how my grid search looks like:
grid_search = GridSearchCV(estimator=RandomForest(),
param_grid={'max_depth':[1, 3, 6], 'n_estimators':[10, 100, 300]},
n_jobs=-1,
scoring='neg_root_mean_squared_error',
cv=5, verbose=True).fit(X, y)
print(pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score'))
The grid search output result and grid_search.cv_results_ are printed below
Fitting 5 folds for each of 9 candidates, totalling 45 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
mean_fit_time std_fit_time mean_score_time std_score_time \
0 0.210918 0.002450 0.016754 0.000223
1 0.207049 0.001675 0.016579 0.000147
2 0.206495 0.002001 0.016598 0.000158
3 0.206799 0.002417 0.016740 0.000144
4 0.207534 0.001603 0.016668 0.000269
5 0.206384 0.001396 0.016605 0.000136
6 0.220052 0.024280 0.017247 0.001137
7 0.226838 0.027507 0.017351 0.000979
8 0.205738 0.003420 0.016246 0.000626
param_max_depth param_n_estimators params \
0 1 10 {'max_depth': 1, 'n_estimators': 10}
1 1 100 {'max_depth': 1, 'n_estimators': 100}
2 1 300 {'max_depth': 1, 'n_estimators': 300}
3 3 10 {'max_depth': 3, 'n_estimators': 10}
4 3 100 {'max_depth': 3, 'n_estimators': 100}
5 3 300 {'max_depth': 3, 'n_estimators': 300}
6 6 10 {'max_depth': 6, 'n_estimators': 10}
7 6 100 {'max_depth': 6, 'n_estimators': 100}
8 6 300 {'max_depth': 6, 'n_estimators': 300}
split0_test_score split1_test_score split2_test_score split3_test_score \
0 -5.246725 -3.200585 -3.326962 -3.209387
1 -5.246725 -3.200585 -3.326962 -3.209387
2 -5.246725 -3.200585 -3.326962 -3.209387
3 -5.246725 -3.200585 -3.326962 -3.209387
4 -5.246725 -3.200585 -3.326962 -3.209387
5 -5.246725 -3.200585 -3.326962 -3.209387
6 -5.246725 -3.200585 -3.326962 -3.209387
7 -5.246725 -3.200585 -3.326962 -3.209387
8 -5.246725 -3.200585 -3.326962 -3.209387
split4_test_score mean_test_score std_test_score rank_test_score
0 -2.911422 -3.579016 0.845021 1
1 -2.911422 -3.579016 0.845021 1
2 -2.911422 -3.579016 0.845021 1
3 -2.911422 -3.579016 0.845021 1
4 -2.911422 -3.579016 0.845021 1
5 -2.911422 -3.579016 0.845021 1
6 -2.911422 -3.579016 0.845021 1
7 -2.911422 -3.579016 0.845021 1
8 -2.911422 -3.579016 0.845021 1
[Parallel(n_jobs=-1)]: Done 45 out of 45 | elapsed: 3.2s finished
My question is, why does the grid search return the exactly similar result on all the data splits?
My assumption is that, it seems the grid search only executes 1 parameter grid (e.g. {'max_depth': 1, 'n_estimators': 10}) for all data splits. If this is the case, why does it happen?
Finally, how to make the grid search able to return the correct result for all data splits?
Your set_params
method doesn't actually change the hyperparameters of the RandomForestRegressor
instance in the self.model
attribute. Instead, it sets the attributes to your RandomForest
instance directly (which didn't exist before, and don't affect the actual model!). So the grid search repeatedly sets these new parameters that don't matter, and the actual model being fit is the same every time. (Similarly, the get_params
method gets the RandomForest
attributes, which are not the same as the RandomForestRegressor
attributes.)
You should be able to fix most of this by having set_params
just call self.model.set_params
(and have get_params
use self.model.<parameter_name>
instead of just self.<parameter_name>
.
There's another problem, I think, but I don't know how your example runs at all because of it: you instantiate the model
attribute using self.<parameter_name>
, but that's never defined in __init__
.