Search code examples
pythonmachine-learningscikit-learngrid-searchgridsearchcv

Grid Search Returns the Exactly Same Result Given a Custom Model


I am wrapping the Scikit-Learn Random Forest model in a function, as following:

from sklearn.base import BaseEstimator, RegressorMixin

class Model(BaseEstimator, RegressorMixin):
    def __init__(self, model):
        self.model = model
    
    def fit(self, X, y):
        self.model.fit(X, y)
        
        return self
    
    def score(self, X, y):
           
        from sklearn.metrics import mean_squared_error
        
        return mean_squared_error(y_true=y, 
                                  y_pred=self.model.predict(X), 
                                  squared=False)
    
    def predict(self, X):
        return self.model.predict(X)
class RandomForest(Model):
    def __init__(self, n_estimators=100, 
                 max_depth=None, min_samples_split=2,
                 min_samples_leaf=1, max_features=None):
        
        self.n_estimators=n_estimators 
        self.max_depth=max_depth
        self.min_samples_split=min_samples_split
        self.min_samples_leaf=min_samples_leaf
        self.max_features=max_features
           
        from sklearn.ensemble import RandomForestRegressor
 
        self.model = RandomForestRegressor(n_estimators=self.n_estimators, 
                                           max_depth=self.max_depth, 
                                           min_samples_split=self.min_samples_split,
                                           min_samples_leaf=self.min_samples_leaf, 
                                           max_features=self.max_features,
                                           random_state = 777)
    
    
    def get_params(self, deep=True):
        return {"n_estimators": self.n_estimators,
                "max_depth": self.max_depth,
                "min_samples_split": self.min_samples_split,
                "min_samples_leaf": self.min_samples_leaf,
                "max_features": self.max_features}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

I mainly follow the Scikit-Learn official guide which can be found at https://scikit-learn.org/stable/developers/develop.html

Here is how my grid search looks like:

grid_search = GridSearchCV(estimator=RandomForest(), 
                            param_grid={'max_depth':[1, 3, 6], 'n_estimators':[10, 100, 300]},
                            n_jobs=-1, 
                            scoring='neg_root_mean_squared_error',
                            cv=5, verbose=True).fit(X, y)
    
print(pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score'))

The grid search output result and grid_search.cv_results_ are printed below

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
   mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
0       0.210918      0.002450         0.016754        0.000223   
1       0.207049      0.001675         0.016579        0.000147   
2       0.206495      0.002001         0.016598        0.000158   
3       0.206799      0.002417         0.016740        0.000144   
4       0.207534      0.001603         0.016668        0.000269   
5       0.206384      0.001396         0.016605        0.000136   
6       0.220052      0.024280         0.017247        0.001137   
7       0.226838      0.027507         0.017351        0.000979   
8       0.205738      0.003420         0.016246        0.000626   

  param_max_depth param_n_estimators                                 params  \
0               1                 10   {'max_depth': 1, 'n_estimators': 10}   
1               1                100  {'max_depth': 1, 'n_estimators': 100}   
2               1                300  {'max_depth': 1, 'n_estimators': 300}   
3               3                 10   {'max_depth': 3, 'n_estimators': 10}   
4               3                100  {'max_depth': 3, 'n_estimators': 100}   
5               3                300  {'max_depth': 3, 'n_estimators': 300}   
6               6                 10   {'max_depth': 6, 'n_estimators': 10}   
7               6                100  {'max_depth': 6, 'n_estimators': 100}   
8               6                300  {'max_depth': 6, 'n_estimators': 300}   

   split0_test_score  split1_test_score  split2_test_score  split3_test_score  \
0          -5.246725          -3.200585          -3.326962          -3.209387   
1          -5.246725          -3.200585          -3.326962          -3.209387   
2          -5.246725          -3.200585          -3.326962          -3.209387   
3          -5.246725          -3.200585          -3.326962          -3.209387   
4          -5.246725          -3.200585          -3.326962          -3.209387   
5          -5.246725          -3.200585          -3.326962          -3.209387   
6          -5.246725          -3.200585          -3.326962          -3.209387   
7          -5.246725          -3.200585          -3.326962          -3.209387   
8          -5.246725          -3.200585          -3.326962          -3.209387   

   split4_test_score  mean_test_score  std_test_score  rank_test_score  
0          -2.911422        -3.579016        0.845021                1  
1          -2.911422        -3.579016        0.845021                1  
2          -2.911422        -3.579016        0.845021                1  
3          -2.911422        -3.579016        0.845021                1  
4          -2.911422        -3.579016        0.845021                1  
5          -2.911422        -3.579016        0.845021                1  
6          -2.911422        -3.579016        0.845021                1  
7          -2.911422        -3.579016        0.845021                1  
8          -2.911422        -3.579016        0.845021                1  
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:    3.2s finished

My question is, why does the grid search return the exactly similar result on all the data splits?

My assumption is that, it seems the grid search only executes 1 parameter grid (e.g. {'max_depth': 1, 'n_estimators': 10}) for all data splits. If this is the case, why does it happen?

Finally, how to make the grid search able to return the correct result for all data splits?


Solution

  • Your set_params method doesn't actually change the hyperparameters of the RandomForestRegressor instance in the self.model attribute. Instead, it sets the attributes to your RandomForest instance directly (which didn't exist before, and don't affect the actual model!). So the grid search repeatedly sets these new parameters that don't matter, and the actual model being fit is the same every time. (Similarly, the get_params method gets the RandomForest attributes, which are not the same as the RandomForestRegressor attributes.)

    You should be able to fix most of this by having set_params just call self.model.set_params (and have get_params use self.model.<parameter_name> instead of just self.<parameter_name>.

    There's another problem, I think, but I don't know how your example runs at all because of it: you instantiate the model attribute using self.<parameter_name>, but that's never defined in __init__.