python machine-learning scikit-learn cross-validation feature-selection

Applying Pipeline and Nested Cross validation partly manually with RFECV and GridSearch

I am trying to implement nested cross validation in a bit more manual way than using GridSearchCV within a cross_val_score, so I can have a bit more control of what I am doing, account for data leakage, see how my vars and hyperparameters are changing along the way and keep a record of them, and also to understand more clearly how everything is working.

I would like to ask if this implementation below looks good to you.

Processing Pipelines: Data Leakage

Categorical Variables

cat_pipe = Pipeline([("encoder",OneHotEncoder(drop="first"))])

numerical features that could be categorical

num_pipe = Pipeline([("encoder",OneHotEncoder(drop="first"))])

Continious variables

num_cont_pipe = Pipeline([("scaler",StandardScaler())])

preprocessor = ColumnTransformer(transformers=[('cat', cat_pipe, cat_var),
                                               ('num', num_pipe, num_cat_vars), "num_cont",num_cont_pipe,num_cont)],n_jobs=-1)#,sparse_threshold=0)

Nested Cross Validation

cv_outer = StratifiedKFold(n_splits=5, random_state=0)
cv_inner = StratifiedKFold(n_splits=10,random_state=0) # RepeatedStratifiedKFold(n_repeats=3,n_splits=10,random_state=0)

Estimator and model

estimator = RandomForestClassifier(class_weight="balanced")
model = XGBClassifier(eval_metric="logloss")

Feature Selection and Hyperparameter Optimization

rfe = RFECV(estimator,cv=cv_inner, scoring="roc_auc",n_jobs=-1)

Implementation

for train_ix, test_ix in cv_outer.split(X,y):
    time_start_outer = time.time()

    X_train = X.iloc[train_ix,:]
    X_test = X.iloc[test_ix,:]
    y_train = y.iloc[train_ix]
    y_test = y.iloc[test_ix]


    # Data Prep
    X_train_enc = preprocessor.fit_transform(X_train)
    X_test_enc = preprocessor.transform(X_test)

        
    # Feature Selection
    X_train_enc_var = rfe.fit(X_train_enc,y_train)
    print(f"The number of features selected is {X_train_enc_var.n_features_}", sep="/n")
    X_train_enc_var = rfe.transform(X_train_enc)
    X_test_enc_var = rfe.transform(X_test_enc)

    # Hyperparameter tuning 
    counter = Counter(y_train)
    weight = counter[0] / counter[1]
    hyper_params = {
        # XGBClassifier
        "scale_pos_weight":[1,2,3,4,5,6,weight,7,8,9,10,11,12]
        }

    grid_search = GridSearchCV(model,cv=cv_inner,param_grid=hyper_params,n_jobs=-1,scoring="roc_auc")
    X_train_enc_var_hyp = grid_search.fit(X_train_enc_var,y_train)

    best_model = X_train_enc_var_hyp.best_estimator_

    yhat = best_model.predict(X_test_enc_var)
    # evaluate the model
    score = roc_auc_score(y_test, yhat)

    # store the result
    outer_results.append(score)
    outer_params.append(X_train_enc_var_hyp.best_params_)

    # report progress
    print(f' test_score= {score}, validation_score= {X_train_enc_var_hyp.best_score_}')


# summarize the estimated performance of the model
print(f'best_score: {np.mean(outer_results)}, {np.std(outer_results)}')

Solution

I noticed three things.

First, not important:

    # Feature Selection
    X_train_enc_var = rfe.fit(X_train_enc,y_train)
    print(f"The number of features selected is {X_train_enc_var.n_features_}", sep="/n")
    X_train_enc_var = rfe.transform(X_train_enc)
    X_test_enc_var = rfe.transform(X_test_enc)

You save the fitted rfe out to X_train_enc_var, and then two lines later overwrite that as a transformed dataset. This all works out the way you want, but maybe to be more honest to the variable names:

    X_train_enc_var = rfe.fit_transform(X_train_enc,y_train)
    X_test_enc_var = rfe.transform(X_test_enc)
    print(f"The number of features selected is {rfe.n_features_}", sep="/n")

Second, you use the same cv_inner for, separately, the recursive feature elimination and the grid search. This means the number of features selected has information from the (inner) test folds, and so the grid search scores may be optimistically biased. That's probably not a big deal, because you only care about relative scores for the search. But maybe some hyperparameter combination would do a better job with a different number of features, and so gets penalized.

Finally, most serious but also easiest to fix: you call best_model.predict(...), but since you're taking the roc_auc_score, you will want instead best_model.predict_proba(...)[:, 1] for the probability scores (for the positive class).

One other suggestion: since you're going through a lot of computation, save some more information: the rfe's grid_scores_, the grid search's cv_results_ for example.