I am trying to perform logistic regression using StratifiedGroupKFold
as shown in the following code.
grid={'C':np.logspace(-3,3,7)}
grkf_cv = StratifiedGroupKFold(n_splits=10)
id_ls = X_train_df['ID'].to_list()
log_reg = LogisticRegression(max_iter=100, random_state=42)
logreg_cv = GridSearchCV(log_reg, grid, cv=grkf_cv, scoring='roc_auc')
logreg_cv.fit(X_train_df, y_train_df, groups=id_ls)
This causes a conflict as the model is training with the group ID which is incorrect and it appears as a feature. My issue is I need to pass id_ls
with X_train_df
(which contains the ID). I am not sure how splits would be performed if X_train_df
did not contain the ID.
I can drop the ID from X_train_df
and then train but I do not think the splits would be performed based on groups.
Is there a way around this problem.
In the example in sklearn documentation (found here), you can see that they define the groups
parameter separately, without it ever being a part of the training dataset.
I am assuming this is because the groups
parameter does not have to be checked against a column, as it already contains the group label for each sample in order.
It makes sense, the function knows that the first row of X_train
has group id the first element of id_ls
(which you are passing at the groups
parameter), the second row is matched to the second element of the list etc.