Search code examples
classificationdata-sciencelightgbmearly-stopping

Light GBM Value Error: ValueError: For early stopping, at least one dataset and eval metric is required for evaluation


Here is my code. It is a binary classification problem and the evaluation criteria are the AUC score. I have looked at one solution on Stack Overflow and implemented it but did not work and still giving me an error.

param_grid =   {
    'n_estimators' : [1000, 10000],  
    'boosting_type': ['gbdt'],
    'num_leaves': [30, 35],
    #'learning_rate': [0.01, 0.02, 0.05],
    #'colsample_bytree': [0.8, 0.95 ],
    'subsample': [0.8, 0.95],
    'is_unbalance': [True, False],
    #'reg_alpha'  : [0.01, 0.02, 0.05],
    #'reg_lambda' : [0.01, 0.02, 0.05],
    'min_split_gain' :[0.01, 0.02, 0.05]
    }
    
lgb  =  LGBMClassifier(random_state=42, early_stopping_rounds = 10, eval_metric  = 'auc', verbose_eval=20)


grid_search = GridSearchCV(lgb, param_grid= param_grid,
                            scoring='roc_auc', cv=5, n_jobs=-1, verbose=1)

grid_search.fit(X_train, y_train, eval_set = (X_val, y_val))

best_model = grid_search.best_estimator_
start = time()
best_model.fit(X_train, y_train)
Train_time = round(time() - start, 4)

Error happens at best_model.fit(X_train, y_train)


Solution

  • Answer

    This error is caused by the fact that you used early stopping during grid search, but decided not to use early stopping when fitting the best model over the full dataset.

    Some keyword arguments you pass into LGBMClassifier are added to the params in the model object produced by training, including early_stopping_rounds.

    To disable early stopping, you can use update_params().

    best_model = grid_search.best_estimator_
    
    # ---------------- my added code -----------------------#
    # inspect current parameters
    params = best_model.get_params()
    print(params)
    
    # remove early_stopping_rounds
    params["early_stopping_rounds"] = None
    best_model.set_params(**params)
    # ------------------------------------------------------#
    
    best_model.fit(X_train, y_train)
    
    

    More Details

    I made some assumptions to turn your question into a minimal reproducible example. In the future, I recommend doing that when you ask questions here. It will help you get better, faster help.

    I installed lightgbm 3.1.0 with pip install lightgbm==3.1.0. I'm using Python 3.8.3 on Mac.

    Things I changed from your example to make it an easier-to-use reproduction

    • removed commented code
    • cut the number of iterations to [10, 100] and num_leaves to [8, 10] so training would run much faster
    • added imports
    • added a specific dataset and code to produce it repeatably

    reproducible example

    from lightgbm import LGBMClassifier
    from sklearn.datasets import load_breast_cancer
    from sklearn.model_selection import GridSearchCV, train_test_split
    
    param_grid =   {
        'n_estimators' : [10, 100],
        'boosting_type': ['gbdt'],
        'num_leaves': [8, 10],
        'subsample': [0.8, 0.95],
        'is_unbalance': [True, False],
        'min_split_gain' :[0.01, 0.02, 0.05]
    }
    
    lgb = LGBMClassifier(
        random_state=42,
        early_stopping_rounds = 10,
        eval_metric  = 'auc',
        verbose_eval=20
    )
    
    grid_search = GridSearchCV(
        lgb,
        param_grid= param_grid,
        scoring='roc_auc',
        cv=5,
        n_jobs=-1,
        verbose=1
    )
    
    X, y = load_breast_cancer(return_X_y=True)
    
    
    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=0.1,
        random_state=42
    )
                                     
    grid_search.fit(
        X_train,
        y_train,
        eval_set = (X_test, y_test)
    )
    
    best_model = grid_search.best_estimator_
    
    # ---------------- my added code -----------------------#
    # inspect current parameters
    params = best_model.get_params()
    print(params)
    
    # remove early_stopping_rounds
    params["early_stopping_rounds"] = None
    best_model.set_params(**params)
    # ------------------------------------------------------#
    
    best_model.fit(X_train, y_train)