python machine-learning regression cross-validation lightgbm

Light GBM Regression CV Interpreting Results

I've looked at the docs and could not find an answer to my question, hoping someone here knows. Here is some sample code:

N_FOLDS= 5

model = lgb.LGBMClassifier()
default_params = model.get_params()

#overwriting a param
default_params['objective'] = 'regression'

cv_results = lgb.cv(default_params, train_set, num_boost_round = 100000, nfold = N_FOLDS, 
                    early_stopping_rounds = 100, metrics = 'rmse', seed = 50, stratified=False)

I get a dictionary like this with 6 distinct values:

{'rmse-mean': [635.2078190031074,
  632.0847253839236,
  629.6661071275558,
  627.9721515847672,
  626.6712284533291,
  625.293530527769],
 'rmse-stdv': [197.5088741303537,
  198.66960690389863,
  199.56134068525006,
  200.25929541235243,
  200.8251430042537,
  201.50213772830526]}

At first, I thought the values in that dictionary corresponded to the RMSE for each of the folds (which in this case is 5), but it doesn't seem to be that way. The dictionary looks like its descending on RMSE value.

Does anyone know what each value corresponds to?

Solution

It does not correspond to the fold but rather to the cv result (mean of RMSE across all test folds) for each boosting round, you can see this very clearly if we do say just 5 rounds and print the results each round:

import lightgbm as lgb
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
train_set = lgb.Dataset(X,label = y)

params = {'learning_rate': 0.05,'num_leaves': 4,'subsample': 0.5}

cv_results = lgb.cv(params, train_set, num_boost_round = 5, nfold = N_FOLDS, verbose_eval  = True,
                    early_stopping_rounds = None, metrics = 'rmse', seed = 50, stratified=False)

[LightGBM] [Info] Total Bins 1251
[LightGBM] [Info] Number of data points in the train set: 404, number of used features: 13
[LightGBM] [Info] Start training from score 22.585149
[LightGBM] [Info] Start training from score 22.109406
[LightGBM] [Info] Start training from score 22.579703
[LightGBM] [Info] Start training from score 22.784158
[LightGBM] [Info] Start training from score 22.599010
[1] cv_agg's rmse: 8.86903 + 0.88135
[2] cv_agg's rmse: 8.58355 + 0.860252
[3] cv_agg's rmse: 8.31477 + 0.842578
[4] cv_agg's rmse: 8.06201 + 0.82627
[5] cv_agg's rmse: 7.8268 + 0.800053

import pandas as pd
pd.DataFrame(cv_results)

    rmse-mean   rmse-stdv
0   8.869030    0.881350
1   8.583552    0.860252
2   8.314774    0.842578
3   8.062014    0.826270
4   7.826800    0.800053

In your post, you set the early_stopping_rounds = 100 and used the default of learning rate = 0.1 which might be a bit high depending on your data, so chances are that it stopped after 6 rounds.

Using the same example above, you can that if we set early_stopping_rounds = 100, it evaluates the improvement in metric every 100 rounds and returns the result 100 rounds before it stopped:

cv_results = lgb.cv(params, train_set, num_boost_round = 2000, nfold = N_FOLDS, 
verbose_eval  = True,early_stopping_rounds = 100, metrics = 'rmse',
seed = 50, stratified=False)

[...]
[1475]  cv_agg's rmse: 3.20605 + 0.50213
[1476]  cv_agg's rmse: 3.20616 + 0.501997
[1477]  cv_agg's rmse: 3.20607 + 0.501998
[1478]  cv_agg's rmse: 3.20636 + 0.501865
[1479]  cv_agg's rmse: 3.20631 + 0.501905
[1480]  cv_agg's rmse: 3.20633 + 0.501731
[1481]  cv_agg's rmse: 3.20659 + 0.501494
[1482]  cv_agg's rmse: 3.2068 + 0.502046
[1483]  cv_agg's rmse: 3.20687 + 0.50213
[1484]  cv_agg's rmse: 3.20701 + 0.502265
[1485]  cv_agg's rmse: 3.20717 + 0.502096
[1486]  cv_agg's rmse: 3.2072 + 0.501779
[1487]  cv_agg's rmse: 3.20722 + 0.501613
[1488]  cv_agg's rmse: 3.20718 + 0.501308
[1489]  cv_agg's rmse: 3.20701 + 0.501232

pd.DataFrame(cv_results).shape
(1389, 2)

If you want an estimate of the rmse from your model, take the last value.