python scikit-learn classification logistic-regression cross-validation

Difference Between Python's Functions `cls.score` and `cls.cv_result_`

I have written a code for a logistic regression in Python (Anaconda 3.5.2 with sklearn 0.18.2). I have implemented GridSearchCV() and train_test_split() to sort parameters and split the input data.

My goal is to find the overall (average) accuracy over the 10 folds with a standard error on the test data. Additionally, I try to predict correctly predicted class labels, creating a confusion matrix and preparing a classification report summary.

Please, advise me in the following:

(1) Is my code correct? Please, check each part.

(2) I have tried two different Sklearn functions, clf.score() and clf.cv_results_. I see that they give different results. Which one is correct? (However, the summaries are not included).

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.pipeline import Pipeline

# Load any n x m data and label column. No missing or NaN values.
# I am skipping loading data part. One can load any data to test below code.

sc = StandardScaler()
lr = LogisticRegression()
pipe = Pipeline(steps=[('sc', sc), ('lr', lr)])
parameters = {'lr__C': [0.001, 0.01]}

if __name__ == '__main__':

        clf = GridSearchCV(pipe, parameters, n_jobs=-1, cv=10, refit=True)

        X_train, X_test, y_train, y_test = train_test_split(Data, labels, random_state=0)


       # Train the classifier on data1's feature and target data
        clf.fit(X_train, y_train)

        print("Accuracy on training set: {:.2f}% \n".format((clf.score(X_train, y_train))*100))
        print("Accuracy on test set: {:.2f}%\n".format((clf.score(X_test, y_test))*100))
        print("Best Parameters: ")
        print(clf.best_params_)

     # Alternately using cv_results_
       print("Accuracy on training set: {:.2f}% \n", (clf.cv_results_['mean_train_score'])*100))
       print("Accuracy on test set: {:.2f}%\n", (clf.cv_results_['mean_test_score'])*100))

    # Predict class labels
    y_pred = clf.best_estimator_.predict(X_test)

    # Confusion Matrix
    class_names = ['Positive', 'Negative']
    confMatrix = confusion_matrix(y_test, y_pred)
    print(confMatrix)

    # Accuracy Report
    classificationReport = classification_report(labels, y_pred, target_names=class_names)
    print(classificationReport)

I will appreciate any advise.

Solution

First of all, the desired metrics, i. e. the accuracy metrics, is already considered a default scorer of LogisticRegression(). Thus, we may omit to define scoring='accuracy' parameter of GridSearchCV().
Secondly, the parameter score(X, y) returns the value of the chosen metrics IF the classifier has been refit with the best_estimator_ after sorting all possible options taken from param_grid. It works like so as you have provided refit=True. Note that clf.score(X, y) == clf.best_estimator_.score(X, y). Thus, it does not print out the averaged metrics but rather the best metrics.
Thirdly, the parameter cv_results_ is a much broader summary as it includes the results of each fit. However, it prints out the averaged results obtained by averaging the batch results. These are the values that you wish to store.

Quick Example

Let me hereby introduce a toy example for better understanding:

from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, 0.2)    

param_grid = {'C': [0.001, 0.01]}
clf = GridSearchCV(cv=10, estimator=LogisticRegression(), refit=True, 
param_grid=param_grid)
clf.fit(X_train, y_train)
clf.best_estimator_.score(X_train, y_train)
print('____')
clf.cv_results_

This code yields the following:

0.98107957707289928 # which is the best possible accuracy score

{'mean_fit_time': array([ 0.15465896, 0.23701136]),

'mean_score_time': array([ 0.0006465 , 0.00065773]),

'mean_test_score': array([ 0.934335 , 0.9376739]),

'mean_train_score': array([ 0.96475625, 0.98225632]),

'param_C': masked_array(data = [0.001 0.01],

'params': ({'C': 0.001}, {'C': 0.01})

mean_train_score has two mean values as I grid over two options for C parameter.

I hope that helps!