Search code examples
scikit-learnlogistic-regressiongrid-search

Why doesn't GridSearchCV give C with highest AUC when scoring roc_auc in logistic regression


I'm new to this so apologies if this is obvious.

lr = LogisticRegression(penalty = 'l1')
parameters = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
clf = GridSearchCV(lr, parameters, scoring='roc_auc', cv = 5)
clf.fit(X, Y)
print clf.score(X, Y)
tn, fp, fn, tp = metrics.confusion_matrix(Y, clf.predict(X)).ravel()
print tn, fp, fn, tp

I want to run a Logistic Regression - I'm using an L1 penalty because I want to reduce the number of features I use. I'm using GridSearchCV to find the best C value for the Logistic Regression

I run this and get C = 0.001, AUC = 0.59, Confusion matrix: 46, 0, 35, 0. Only 1 feature has a non-zero coefficient. I go back to my code and remove the option of C = 0.001 from my parameter list and run it again. Now I get C = 1, AUC = 0.95, Confusion matrix: 42, 4, 6, 29. Many, but not all, features have a non-zero coefficient.

I thought that since I have scoring as 'roc_auc' shouldn't the model be created with a better AUC?

Thinking this may be to do with my l1 penalty I switched it to l2. But this gave C = 0.001, AUC = 0.80, CM = 42,4,16,19 and again when I removed C = 0.001 as an option it gave C = 0.01, AUC = 0.88, CM = 41,5,13,22.

There is less of an issue with the l2 penalty but it seems to be a pretty big difference in l1. Is it a penalty thing?

From some of my readings I know ElasticNet is supposed to combine some l1 and l2 - is that where I should be looking?

Also, not completely relevant but while I'm posting - I haven't done any data normalization for this. That is normal for Logistic Regression?


Solution

  • clf.score(X, Y) is the score on the training dataset (the gridsearch refits the model on the entire dataset after it's chosen the best parameters), you don't want to use this to evaluate your model. This also isn't what the gridsearch uses internally in its model selection, instead it uses cross-validated folds and takes the average. You can access the actual score used in the model selection with clf.best_score_.