Search code examples
pythonoptimizationdata-sciencelightgbm

How does best estimator fitting work in RandomizedSearchCV?


I used RandomizedSearchCV (RSCV) with the default 5-fold CV for LGBMClassifier with an evaluation set.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model_LGBM=LGBMClassifier(objective='binary',metric='auc',random_state=0,early_stopping_round=100)

distributions = dict(max_depth=range(1,10),
                     num_leaves=[50,100,150],
                     learning_rate=[0.1,0.2,0.3],
                     )

clf = RandomizedSearchCV(model_LGBM, distributions, random_state=0,n_iter=100,verbose=10)
clf.fit(X_train,y_train,eval_set=(X_test,y_test))

So the output of the RSCV looks like:

First iter: CV 1/5, "valid0's" CV 2/5 "valid0's", ..., CV 5/5 "valid0's";
Second iter: CV 1/5 "valid0's", CV 2/5 "valid0's", ..., CV 5/5 "valid0's";
...
Last iter: CV 1/5 "valid0's", CV 2/5 "valid0's", ..., CV 5/5 "valid0's";
+1 fit with "valid0's"

I suppose the last fit is the refitted best estimator. Does it use the whole training set? Where does it use the evaluation set?


Solution

  • According to the docs (present here), if the refit parameter is True (which it is by default) the model get trained at the end using the best parameters found on the entire dataset (train data in this case) inputted.