I was using LogisticRegressionCV's .score()
method to yield an accuracy score for my model.
I also used cross_val_score
to yield an accuracy score with the same cv split (skf
), expecting the same score to show up.
But alas, they were different and I'm confused.
I first did a StratifiedKFold:
skf = StratifiedKFold(n_splits = 5,
shuffle = True,
random_state = 708)
After which I instantiated a LogisticRegressionCV() with the skf
as an argument for the CV parameter, fitted, and scored on the training set.
logreg = LogisticRegressionCV(cv=skf, solver='liblinear')
logreg.fit(X_train_sc, y_train)
logreg.score(X_train_sc, y_train)
This gave me a score of 0.849507735583685, which was accuracy by default. Since this is LogisticRegressionCV, this score is actually the mean accuracy score right?
Then I used cross_val_score
cross_val_score(logreg, X_train_sc, y_train, cv=skf).mean()
This gave me a mean accuracy score of 0.8227814439082044.
I'm kind of confused as to why the scores differ, since I thought I was basically doing the same thing.
] is actually the mean accuracy score right?
No. The score
method here is the accuracy score of the final classifier (which was retrained on the entire training set, using the optimal value of the regularization strength). By evaluating it on the training set again, you're getting an optimistically-biased estimate of future performance.
To recover the cross-validation scores, you can use the attribute scores_
. Even with the same folds, these may be slightly different from cross_val_score
due to randomness in the solver, if it doesn't converge completely.