python machine-learning scikit-learn cross-validation lightgbm

Why can't I match LGBM's cv score?

I'm unable to match LGBM's cv score by hand.

Here's a MCVE:

from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import numpy as np

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

folds = KFold(5, random_state=42)

params = {'random_state': 42}

results = lgb.cv(params, lgb.Dataset(X_train, y_train), folds=folds, num_boost_round=1000, early_stopping_rounds=100, metrics=['auc'])
print('LGBM\'s cv score: ', results['auc-mean'][-1])

clf = lgb.LGBMClassifier(**params, n_estimators=len(results['auc-mean']))

val_scores = []
for train_idx, val_idx in folds.split(X_train):
    clf.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
    val_scores.append(roc_auc_score(y_train.iloc[val_idx], clf.predict_proba(X_train.iloc[val_idx])[:,1]))
print('Manual score: ', np.mean(np.array(val_scores)))

I was expecting the two CV scores to be identical - I have set random seeds, and done exactly the same thing. Yet they differ.

Here's the output I get:

LGBM's cv score:  0.9851513530737058
Manual score:  0.9903622177441328

Why? Am I not using LGMB's cv module correctly?

Solution

You are splitting X into X_train and X_test. For cv you split X_train into 5 folds while manually you split X into 5 folds. i.e you use more points manually than with cv.

change results = lgb.cv(params, lgb.Dataset(X_train, y_train) to results = lgb.cv(params, lgb.Dataset(X, y)

Futhermore, there can be different parameters. For example, the number of threads used by lightgbm changes the result. During cv the models are fitted in parallel. Hence the number of threads used might differ from your manual sequential training.

EDIT after 1st correction:

You can achieve the same results using manual splitting / cv using this code:

from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import numpy as np

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

folds = KFold(5, random_state=42)


params = {
        'task': 'train',
        'boosting_type': 'gbdt',
        'objective':'binary',
        'metric':'auc',
        }

data_all = lgb.Dataset(X_train, y_train)

results = lgb.cv(params, data_all, 
                 folds=folds.split(X_train), 
                 num_boost_round=1000, 
                 early_stopping_rounds=100)

print('LGBM\'s cv score: ', results['auc-mean'][-1])

val_scores = []
for train_idx, val_idx in folds.split(X_train):

    data_trd = lgb.Dataset(X_train.iloc[train_idx], 
                           y_train.iloc[train_idx], 
                           reference=data_all)

    gbm = lgb.train(params,
                    data_trd,
                    num_boost_round=len(results['auc-mean']),
                    verbose_eval=100)

    val_scores.append(roc_auc_score(y_train.iloc[val_idx], gbm.predict(X_train.iloc[val_idx])))
print('Manual score: ', np.mean(np.array(val_scores)))

yields

LGBM's cv score:  0.9914524426410262
Manual score:  0.9914524426410262

What makes the difference is this line reference=data_all. During cv, the binning of the variables (refers to lightgbm doc) is constructed using the whole dataset (X_train) while in you manual for loop it was built on the training subset (X_train.iloc[train_idx]). By passing the reference to the dataset containg all the data, lightGBM will reuse the same binning, giving same results.