machine-learning scikit-learn grid-search gridsearchcv

After hyperparameter tuning accuracy remains the same

I was trying to hyper tune param but after I did it, the accuracy score has not changed at all, what I do wrong?

 # Log reg
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=0.3326530612244898,max_iter=100,tol=0.01)
logreg.fit(X_train,y_train)

from sklearn.metrics import confusion_matrix

y_pred = logreg.predict(X_test)

print('Accuracy of log reg is: ', logreg.score(X_test,y_test))

confusion_matrix(y_test,y_pred)
 # 0.9181286549707602 - acurracy before tunning

Output:

Accuracy of log reg is:  0.9181286549707602
array([[ 54,   9],
       [  5, 103]])

Here is me Using Grid Search CV:

from sklearn.model_selection import GridSearchCV
params ={'tol':[0.01,0.001,0.0001],
        'max_iter':[100,150,200],
        'C':np.linspace(1,20)/10}

grid_model = GridSearchCV(logreg,param_grid=params,cv=5)
grid_model_result = grid_model.fit(X_train,y_train)
print(grid_model_result.best_score_,grid_model_result.best_params_)

Output:

0.8867405063291139 {'C': 0.3326530612244898, 'max_iter': 100, 'tol': 0.01}

Solution

The problem was that in the first chunk you evaluate the model's performance on the test set, while in the GridSearchCV you only looked at the performance on the training set after hyperparameter optimization.

The code below shows that both procedures, when used to predict the test set labels, perform equally well in terms of accuracy (~0.93).

Note, you might want to consider using a hyperparameter grid with other solvers and a larger range of max_iter because I obtained convergence warnings.

# Load packages
import numpy as np
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

# Load the dataset and split in X and y
df = pd.read_csv('Breast_cancer_data.csv')
X = df.iloc[:, 0:5]
y = df.iloc[:, 5]

# Perform train and test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize a model
Log = LogisticRegression(n_jobs=-1)

# Initialize a parameter grid
params = [{'tol':[0.01,0.001,0.0001],
        'max_iter':[100,150,200],
        'C':np.linspace(1,20)/10}]

# Perform GridSearchCV and store the best parameters
grid_model = GridSearchCV(Log,param_grid=params,cv=5)
grid_model_result = grid_model.fit(X_train,y_train)
best_param = grid_model_result.best_params_

# This step is only to prove that both procedures actually result in the same accuracy score
Log2 = LogisticRegression(C=best_param['C'], max_iter=best_param['max_iter'], tol=best_param['tol'], n_jobs=-1)
Log2.fit(X_train, y_train)

# Perform two predictions one straight from the GridSearch and the other one with manually inputting the best params
y_pred1 = grid_model_result.best_estimator_.predict(X_test)
y_pred2 = Log2.predict(X_test)

# Compare the accuracy scores and see that both are the same
print("Accuracy:",metrics.accuracy_score(y_test, y_pred1))
print("Accuracy:",metrics.accuracy_score(y_test, y_pred2))