I tried to replicate the result of cross_val_score()
when hyper-tuning a XGboost toy model.
I used code NO.1 to do Cross validation whose result was used as a benchmark, and then used code NO.2 and NO.3 to replicate the CV result by manually programming the cross validation loops.
The major difference between code NO.2 and code NO.3 is that I put the initialization of XGboost Classifier outside the for loop in code NO.3 but inside the for loop in code NO.2. I expected that only code NO.2 (the inside-the-loop version) generated the same result as what the automatic cross_val_score
got. To my surprise, all the three versions of code share the same result.
My question is: Shouldn't we clone the model for each validation as mentioned inside the source code of cross_val_score
? And in Code NO.3, the trained Xgboost models are not independent across validations, rights? Non-independence is NOT in the spirit of cross validation, isn't it? But why I got identical results from them?
Code NO.1
params = {
'objective': 'binary:logistic',
'eval_metric': 'auc'
}
model = XGBClassifier(**params)
kfold = StratifiedKFold(n_splits=N_SPLITS, random_state=SEED)
results = cross_val_score(model, X, Y, scoring='accuracy', cv=kfold) # only a single metric is permitted. model is cloned not relay across folds.
print(f'Accuracy: {results.mean()*100:.4f}% ({results.std()*100:.3f})')
Code NO.2
x_train = all_df.drop('Survived', axis=1).iloc[:train_rows].values
y_train = train_label.iloc[:train_rows].values
y_oof = np.zeros(x_train.shape[0])
acc_scores = []
kfold = StratifiedKFold(n_splits=N_SPLITS, random_state=SEED)
for i, (train_index, valid_index) in enumerate(kfold.split(x_train, y_train)):
model = XGBClassifier(**params) # <=======================================
X_A, X_B = x_train[train_index, :], x_train[valid_index, :]
y_A, y_B = y_train[train_index], y_train[valid_index]
model.fit(X_A, y_A, eval_set=[(X_B, y_B)])
y_oof[valid_index] = model.predict(X_B)
acc_scores.append(accuracy_score(y_B, y_oof[valid_index]))
Code NO.3
x_train = all_df.drop('Survived', axis=1).iloc[:train_rows].values
y_train = train_label.iloc[:train_rows].values
y_oof = np.zeros(x_train.shape[0])
acc_scores = []
kfold = StratifiedKFold(n_splits=N_SPLITS, random_state=SEED)
model = XGBClassifier(**params) # <=======================================
for i, (train_index, valid_index) in enumerate(kfold.split(x_train, y_train)):
X_A, X_B = x_train[train_index, :], x_train[valid_index, :]
y_A, y_B = y_train[train_index], y_train[valid_index]
model.fit(X_A, y_A, eval_set=[(X_B, y_B)])
y_oof[valid_index] = model.predict(X_B)
acc_scores.append(accuracy_score(y_B, y_oof[valid_index]))
When you call fit
on an XGBClassifier
instance (or ideally, any sklearn-compatible estimator), the learning starts over from scratch, so that the models are indeed independent across validations.
Of course, re-initializing or cloning the model is slightly safer, especially if you're unsure that the implementation doesn't keep any information lying around to use. cross_val_score
is a wrapper around cross_validate
, and there the cloning is actually needed, in case return_estimator=True
, so that the various copies of the model need to be saved.