I am trying to implement nested cross validation in a bit more manual way than using GridSearchCV
within a cross_val_score
, so I can have a bit more control of what I am doing, account for data leakage, see how my vars and hyperparameters are changing along the way and keep a record of them, and also to understand more clearly how everything is working.
I would like to ask if this implementation below looks good to you.
cat_pipe = Pipeline([("encoder",OneHotEncoder(drop="first"))])
num_pipe = Pipeline([("encoder",OneHotEncoder(drop="first"))])
num_cont_pipe = Pipeline([("scaler",StandardScaler())])
preprocessor = ColumnTransformer(transformers=[('cat', cat_pipe, cat_var),
('num', num_pipe, num_cat_vars), "num_cont",num_cont_pipe,num_cont)],n_jobs=-1)#,sparse_threshold=0)
cv_outer = StratifiedKFold(n_splits=5, random_state=0)
cv_inner = StratifiedKFold(n_splits=10,random_state=0) # RepeatedStratifiedKFold(n_repeats=3,n_splits=10,random_state=0)
estimator = RandomForestClassifier(class_weight="balanced")
model = XGBClassifier(eval_metric="logloss")
rfe = RFECV(estimator,cv=cv_inner, scoring="roc_auc",n_jobs=-1)
for train_ix, test_ix in cv_outer.split(X,y):
time_start_outer = time.time()
X_train = X.iloc[train_ix,:]
X_test = X.iloc[test_ix,:]
y_train = y.iloc[train_ix]
y_test = y.iloc[test_ix]
# Data Prep
X_train_enc = preprocessor.fit_transform(X_train)
X_test_enc = preprocessor.transform(X_test)
# Feature Selection
X_train_enc_var = rfe.fit(X_train_enc,y_train)
print(f"The number of features selected is {X_train_enc_var.n_features_}", sep="/n")
X_train_enc_var = rfe.transform(X_train_enc)
X_test_enc_var = rfe.transform(X_test_enc)
# Hyperparameter tuning
counter = Counter(y_train)
weight = counter[0] / counter[1]
hyper_params = {
# XGBClassifier
"scale_pos_weight":[1,2,3,4,5,6,weight,7,8,9,10,11,12]
}
grid_search = GridSearchCV(model,cv=cv_inner,param_grid=hyper_params,n_jobs=-1,scoring="roc_auc")
X_train_enc_var_hyp = grid_search.fit(X_train_enc_var,y_train)
best_model = X_train_enc_var_hyp.best_estimator_
yhat = best_model.predict(X_test_enc_var)
# evaluate the model
score = roc_auc_score(y_test, yhat)
# store the result
outer_results.append(score)
outer_params.append(X_train_enc_var_hyp.best_params_)
# report progress
print(f' test_score= {score}, validation_score= {X_train_enc_var_hyp.best_score_}')
# summarize the estimated performance of the model
print(f'best_score: {np.mean(outer_results)}, {np.std(outer_results)}')
I noticed three things.
First, not important:
# Feature Selection
X_train_enc_var = rfe.fit(X_train_enc,y_train)
print(f"The number of features selected is {X_train_enc_var.n_features_}", sep="/n")
X_train_enc_var = rfe.transform(X_train_enc)
X_test_enc_var = rfe.transform(X_test_enc)
You save the fitted rfe
out to X_train_enc_var
, and then two lines later overwrite that as a transformed dataset. This all works out the way you want, but maybe to be more honest to the variable names:
X_train_enc_var = rfe.fit_transform(X_train_enc,y_train)
X_test_enc_var = rfe.transform(X_test_enc)
print(f"The number of features selected is {rfe.n_features_}", sep="/n")
Second, you use the same cv_inner
for, separately, the recursive feature elimination and the grid search. This means the number of features selected has information from the (inner) test folds, and so the grid search scores may be optimistically biased. That's probably not a big deal, because you only care about relative scores for the search. But maybe some hyperparameter combination would do a better job with a different number of features, and so gets penalized.
Finally, most serious but also easiest to fix: you call best_model.predict(...)
, but since you're taking the roc_auc_score
, you will want instead best_model.predict_proba(...)[:, 1]
for the probability scores (for the positive class).
One other suggestion: since you're going through a lot of computation, save some more information: the rfe's grid_scores_
, the grid search's cv_results_
for example.