python machine-learning scikit-learn cross-validation grid-search

GridSeachCV with separate training & validation sets erroneously takes also into account the training results for finally choosing the best model

I have a dataset of 3500 observations x 70 features which is my training set and I also have a dataset of 600 observations x 70 features which is my validation set. The target is to classify observations correctly either as 0 or 1.

I use the Xgboost and I aim at the highest possible precision at classification threshold = 0.5.

I am conducting a grid search:

import numpy as np
import pandas as pd
import xgboost

# Import datasets from edge node
data_train = pd.read_csv('data.csv')
data_valid = pd.read_csv('data_valid.csv')
 
# Specify 'data_test' as validation set for the Grid Search below
from sklearn.model_selection import PredefinedSplit
X, y, train_valid_indices = train_valid_merge(data_train, data_valid)
train_valid_merge_indices = PredefinedSplit(test_fold=train_valid_indices)

# Define my own scoring function to see
# if it is called for both the training and the validation sets
from sklearn.metrics import make_scorer
custom_scorer = make_scorer(score_func=my_precision, greater_is_better=True, needs_proba=False)

# Instantiate xgboost
from xgboost.sklearn import XGBClassifier
classifier = XGBClassifier(random_state=0)

# Small parameters' grid ONLY FOR START
# I plan to use way bigger parameters' grids 
parameters = {'n_estimators': [150, 175, 200]}

# Execute grid search and retrieve the best classifier
from sklearn.model_selection import GridSearchCV
classifiers_grid = GridSearchCV(estimator=classifier, param_grid=parameters, scoring=custom_scorer,
                                   cv=train_valid_merge_indices, refit=True, n_jobs=-1)
classifiers_grid.fit(X, y)

............................................................................

train_valid_merge - Specify my own validation set:

I want to do the training of every model with my training set (data_train) and the hyperparameter tuning with a distinct/separate validation set of mine (data_valid). For this reason I define a function called train_valid_merge which concatenates my training and my validation set so that they can be fed to the GridSeachCV and I also used PredefineSplit to specify which is the training and which is the validation set at this merged set:

def train_valid_merge(data_train, data_valid):

    # Set test_fold values to -1 for training observations
    train_indices = [-1]*len(data_train)

    # Set test_fold values to 0 for validation observations
    valid_indices = [0]*len(data_valid)

    # Concatenate the indices for the training and validation sets
    train_valid_indices = train_indices + valid_indices

    # Concatenate data_train & data_valid
    import pandas as pd
    data = pd.concat([data_train, data_valid], axis=0, ignore_index=True)
    X = data.iloc[:, :-1].values
    y = data.iloc[:, -1].values
    return X, y, train_valid_indices

............................................................................

custom_scorer - Specify my own scoring metric:

I define my own scoring function which simply returns the precision just to see if it is called for both the training and the validation sets:

def my_precision(y_true, y_predict):

    # Check length of 'y_true' to see if it is the training or the validation set
    print(len(y_true))

    # Calculate precision
    from sklearn.metrics import precision_score
    precision = precision_score(y_true, y_predict, average='binary')

    return precision

............................................................................

When I run the whole thing (for parameters = {'n_estimators': [150, 175, 200]}) then the following things are printed from the print(len(y_true)) at the my_precision function:

which means that the scoring function is called both for the training and the validation set. But I have tested that that the scoring function is not only called but its results from both the training and validation sets are used to determine the best model from the grid search (even though I have specified it to use only the validation set results).

For example with our 3 parameters values ('n_estimators': [150, 175, 200]) it takes into account the score for both the training and the validation set (2 sets) and hence it produces (3 parameters)x(2 sets) = 6 different grid results. So it picks out the best hyperparameters sets from all these grid results and consequently it may finally pick out one which was from the results with the training set while I wanted to take into account only the validation set (3 results).

However, if I add to the my_precision function something like that to circumvent the training set (by setting all its precision values to 0):

# Remember that the training set has 3500 observations
# and the validation set 600 observations
if(len(y_true>600)):
    return 0

then (as far as I tested it) I certainly get right the best model for my specifications because the training set precision results are too small since they are all 0 to.

My questions are the following:

Why the custom scoring function is taking into account both the training and the validation set to pick out the best model while I have specified with my train_valid_merge_indices that the best model for the Grid Search should be only selected according to the validation set?

How can I make the GridSearchCV to account only for the validation set and the score of the models at it when the selection and the ranking of the models will be done?

Solution

I have one distinct training set and one distinct validation set. I want to train my model on the training set and find the best hyperparameters based on its performance on my distinct validation set.

Then you most certainly need neither PredefinedSplit nor GridSearchCV:

import pandas as pd
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import precision_score

# Import datasets from edge node
data_train = pd.read_csv('data.csv')
data_valid = pd.read_csv('data_valid.csv')

# training data & labels:
X = data_train.iloc[:, :-1].values
y = data_train.iloc[:, -1].values   

# validation data & labels:
X_valid = data_valid.iloc[:, :-1].values
y_true = data_valid.iloc[:, -1].values 

n_estimators = [150, 175, 200]
perf = []

for k_estimators in n_estimators:
    clf = XGBClassifier(n_estimators=k_estimators, random_state=0)
    clf.fit(X, y)

    y_predict = clf.predict(X_valid)
    precision = precision_score(y_true, y_predict, average='binary')
    perf.append(precision)

and perf will contain the performance of your respective classifiers on your validation set...