classification data-science lightgbm early-stopping

Light GBM Value Error: ValueError: For early stopping, at least one dataset and eval metric is required for evaluation

Here is my code. It is a binary classification problem and the evaluation criteria are the AUC score. I have looked at one solution on Stack Overflow and implemented it but did not work and still giving me an error.

param_grid =   {
    'n_estimators' : [1000, 10000],  
    'boosting_type': ['gbdt'],
    'num_leaves': [30, 35],
    #'learning_rate': [0.01, 0.02, 0.05],
    #'colsample_bytree': [0.8, 0.95 ],
    'subsample': [0.8, 0.95],
    'is_unbalance': [True, False],
    #'reg_alpha'  : [0.01, 0.02, 0.05],
    #'reg_lambda' : [0.01, 0.02, 0.05],
    'min_split_gain' :[0.01, 0.02, 0.05]
    }
    
lgb  =  LGBMClassifier(random_state=42, early_stopping_rounds = 10, eval_metric  = 'auc', verbose_eval=20)


grid_search = GridSearchCV(lgb, param_grid= param_grid,
                            scoring='roc_auc', cv=5, n_jobs=-1, verbose=1)

grid_search.fit(X_train, y_train, eval_set = (X_val, y_val))

best_model = grid_search.best_estimator_
start = time()
best_model.fit(X_train, y_train)
Train_time = round(time() - start, 4)

Error happens at best_model.fit(X_train, y_train)

Solution

Answer

This error is caused by the fact that you used early stopping during grid search, but decided not to use early stopping when fitting the best model over the full dataset.

Some keyword arguments you pass into LGBMClassifier are added to the params in the model object produced by training, including early_stopping_rounds.

To disable early stopping, you can use update_params().

best_model = grid_search.best_estimator_

# ---------------- my added code -----------------------#
# inspect current parameters
params = best_model.get_params()
print(params)

# remove early_stopping_rounds
params["early_stopping_rounds"] = None
best_model.set_params(**params)
# ------------------------------------------------------#

best_model.fit(X_train, y_train)

More Details

I made some assumptions to turn your question into a minimal reproducible example. In the future, I recommend doing that when you ask questions here. It will help you get better, faster help.

I installed lightgbm 3.1.0 with pip install lightgbm==3.1.0. I'm using Python 3.8.3 on Mac.

Things I changed from your example to make it an easier-to-use reproduction

removed commented code
cut the number of iterations to [10, 100] and num_leaves to [8, 10] so training would run much faster
added imports
added a specific dataset and code to produce it repeatably

reproducible example

from lightgbm import LGBMClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV, train_test_split

param_grid =   {
    'n_estimators' : [10, 100],
    'boosting_type': ['gbdt'],
    'num_leaves': [8, 10],
    'subsample': [0.8, 0.95],
    'is_unbalance': [True, False],
    'min_split_gain' :[0.01, 0.02, 0.05]
}

lgb = LGBMClassifier(
    random_state=42,
    early_stopping_rounds = 10,
    eval_metric  = 'auc',
    verbose_eval=20
)

grid_search = GridSearchCV(
    lgb,
    param_grid= param_grid,
    scoring='roc_auc',
    cv=5,
    n_jobs=-1,
    verbose=1
)

X, y = load_breast_cancer(return_X_y=True)


X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    random_state=42
)
                                 
grid_search.fit(
    X_train,
    y_train,
    eval_set = (X_test, y_test)
)

best_model = grid_search.best_estimator_

# ---------------- my added code -----------------------#
# inspect current parameters
params = best_model.get_params()
print(params)

# remove early_stopping_rounds
params["early_stopping_rounds"] = None
best_model.set_params(**params)
# ------------------------------------------------------#

best_model.fit(X_train, y_train)