Using GridSearchCV
, I try to maximize AUC
for a LogisticRegression Classifier
clf_log = LogisticRegression(C=1, random_state=0).fit(X_train, y_train)
from sklearn.model_selection import GridSearchCV
grid_params = {'penalty': ['l1','l2'], 'C': [0.001,0.01,0.1,1,10,100], 'max_iter' : [100]}
gs = GridSearchCV(clf_log, grid_params, scoring='roc_auc', cv=5)
gs.fit(X_train, y_train)`
I got gs.best_score_
of 0.7630647186779661
with gs.best_estimator_
and gs.best_params_
, respectively as follow:
<< LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False) >>
{'C': 10, 'max_iter': 100, 'penalty': 'l2'}
However, when I reintroduced these params into my original clf_log
, I only got AUC
of 0.5359918677005525
. What am I missing (I think: CV part)? How can I get and replicate the same results? Thanks!
Grid Search CV uses K fold cross validation, i.e. when you use the fit
method, it divides the data into test and train sets (cv=5 means test set is 1/5 of the dataset) and this is done cv
times (5 in this case). So you shouldn't be using X_train
and y_train
, instead use X
and y
(assuming you don't want a third validation set) as the splitting gets done internally.
gs.fit(X, y)
After this let's say your best parameters are {'C': 10, 'max_iter': 100, 'penalty': 'l2'}
. Now say you want to apply this. If want to replicate the output of your GridSearchCV, then you need to use k fold cross validation again (If you use train_test_split
instead, your results will slightly vary).
from sklearn.model_selection import cross_val_score
np.average(cross_val_score(LogisticRegression(C=10, max_iter=100, penalty='l2'), X, y, scoring='roc_auc', cv=10))
With this you should be getting the same AUC. You can refer this video