Search code examples
pythonmachine-learningregressioncross-validationlightgbm

Light GBM Regression CV Interpreting Results


I've looked at the docs and could not find an answer to my question, hoping someone here knows. Here is some sample code:

N_FOLDS= 5

model = lgb.LGBMClassifier()
default_params = model.get_params()

#overwriting a param
default_params['objective'] = 'regression'

cv_results = lgb.cv(default_params, train_set, num_boost_round = 100000, nfold = N_FOLDS, 
                    early_stopping_rounds = 100, metrics = 'rmse', seed = 50, stratified=False)

I get a dictionary like this with 6 distinct values:

{'rmse-mean': [635.2078190031074,
  632.0847253839236,
  629.6661071275558,
  627.9721515847672,
  626.6712284533291,
  625.293530527769],
 'rmse-stdv': [197.5088741303537,
  198.66960690389863,
  199.56134068525006,
  200.25929541235243,
  200.8251430042537,
  201.50213772830526]}

At first, I thought the values in that dictionary corresponded to the RMSE for each of the folds (which in this case is 5), but it doesn't seem to be that way. The dictionary looks like its descending on RMSE value.

Does anyone know what each value corresponds to?


Solution

  • It does not correspond to the fold but rather to the cv result (mean of RMSE across all test folds) for each boosting round, you can see this very clearly if we do say just 5 rounds and print the results each round:

    import lightgbm as lgb
    from sklearn.datasets import load_boston
    X, y = load_boston(return_X_y=True)
    train_set = lgb.Dataset(X,label = y)
    
    params = {'learning_rate': 0.05,'num_leaves': 4,'subsample': 0.5}
    
    cv_results = lgb.cv(params, train_set, num_boost_round = 5, nfold = N_FOLDS, verbose_eval  = True,
                        early_stopping_rounds = None, metrics = 'rmse', seed = 50, stratified=False)
    
    [LightGBM] [Info] Total Bins 1251
    [LightGBM] [Info] Number of data points in the train set: 404, number of used features: 13
    [LightGBM] [Info] Start training from score 22.585149
    [LightGBM] [Info] Start training from score 22.109406
    [LightGBM] [Info] Start training from score 22.579703
    [LightGBM] [Info] Start training from score 22.784158
    [LightGBM] [Info] Start training from score 22.599010
    [1] cv_agg's rmse: 8.86903 + 0.88135
    [2] cv_agg's rmse: 8.58355 + 0.860252
    [3] cv_agg's rmse: 8.31477 + 0.842578
    [4] cv_agg's rmse: 8.06201 + 0.82627
    [5] cv_agg's rmse: 7.8268 + 0.800053
    
    import pandas as pd
    pd.DataFrame(cv_results)
    
        rmse-mean   rmse-stdv
    0   8.869030    0.881350
    1   8.583552    0.860252
    2   8.314774    0.842578
    3   8.062014    0.826270
    4   7.826800    0.800053
    

    In your post, you set the early_stopping_rounds = 100 and used the default of learning rate = 0.1 which might be a bit high depending on your data, so chances are that it stopped after 6 rounds.

    Using the same example above, you can that if we set early_stopping_rounds = 100, it evaluates the improvement in metric every 100 rounds and returns the result 100 rounds before it stopped:

    cv_results = lgb.cv(params, train_set, num_boost_round = 2000, nfold = N_FOLDS, 
    verbose_eval  = True,early_stopping_rounds = 100, metrics = 'rmse',
    seed = 50, stratified=False)
    
    [...]
    [1475]  cv_agg's rmse: 3.20605 + 0.50213
    [1476]  cv_agg's rmse: 3.20616 + 0.501997
    [1477]  cv_agg's rmse: 3.20607 + 0.501998
    [1478]  cv_agg's rmse: 3.20636 + 0.501865
    [1479]  cv_agg's rmse: 3.20631 + 0.501905
    [1480]  cv_agg's rmse: 3.20633 + 0.501731
    [1481]  cv_agg's rmse: 3.20659 + 0.501494
    [1482]  cv_agg's rmse: 3.2068 + 0.502046
    [1483]  cv_agg's rmse: 3.20687 + 0.50213
    [1484]  cv_agg's rmse: 3.20701 + 0.502265
    [1485]  cv_agg's rmse: 3.20717 + 0.502096
    [1486]  cv_agg's rmse: 3.2072 + 0.501779
    [1487]  cv_agg's rmse: 3.20722 + 0.501613
    [1488]  cv_agg's rmse: 3.20718 + 0.501308
    [1489]  cv_agg's rmse: 3.20701 + 0.501232
    
    pd.DataFrame(cv_results).shape
    (1389, 2)
    

    If you want an estimate of the rmse from your model, take the last value.