Search code examples
pandasscikit-learnlogistic-regressioncross-validationgrid-search

How to perform StratifiedGroupKFold based on ID that should not be part of training?


I am trying to perform logistic regression using StratifiedGroupKFold as shown in the following code.

grid={'C':np.logspace(-3,3,7)}
grkf_cv = StratifiedGroupKFold(n_splits=10)
id_ls = X_train_df['ID'].to_list()  

log_reg = LogisticRegression(max_iter=100, random_state=42)
logreg_cv = GridSearchCV(log_reg, grid, cv=grkf_cv, scoring='roc_auc')
logreg_cv.fit(X_train_df, y_train_df, groups=id_ls)

This causes a conflict as the model is training with the group ID which is incorrect and it appears as a feature. My issue is I need to pass id_ls with X_train_df (which contains the ID). I am not sure how splits would be performed if X_train_df did not contain the ID.

I can drop the ID from X_train_df and then train but I do not think the splits would be performed based on groups.

Is there a way around this problem.


Solution

  • In the example in sklearn documentation (found here), you can see that they define the groups parameter separately, without it ever being a part of the training dataset.

    I am assuming this is because the groups parameter does not have to be checked against a column, as it already contains the group label for each sample in order.

    It makes sense, the function knows that the first row of X_train has group id the first element of id_ls (which you are passing at the groups parameter), the second row is matched to the second element of the list etc.