python scikit-learn cross-validation grid-search

Using Scikit-Learn GridSearchCV for cross validation with PredefinedSplit - Suspiciously good cross validation results

I'd like to use scikit-learn's GridSearchCV to perform a grid search and calculate the cross validation error using a predefined development and validation split (1-fold cross validation).

I'm afraid that I've done something wrong, because my validation accuracy is suspiciously high. Where I think I'm going wrong: I'm splitting up my training data into development and validation sets, training on the development set and recording the cross validation score on the validation set. My accuracy might be inflated because I am really training on a mix of the development and validation sets, then testing on the validation set. I'm not sure if I'm using scikit-learn's PredefinedSplit module correctly. Details below:

Following this answer, I did the following:

    import numpy as np
    from sklearn.model_selection import train_test_split, PredefinedSplit
    from sklearn.grid_search import GridSearchCV

    # I split up my data into training and test sets. 
    X_train, X_test, y_train, y_test = train_test_split(
        data[training_features], data[training_response], test_size=0.2, random_state=550)

    # sanity check - dimensions of training and test splits
    print(X_train.shape)
    print(X_test.shape)
    print(y_train.shape)
    print(y_test.shape)

    # dimensions of X_train and x_test are (323430, 26) and (323430,1) respectively
    # dimensions of X_test and y_test are (80858, 26) and (80858, 1)

    ''' Now, I define indices for a pre-defined split. 
    this is a 323430 dimensional array, where the indices for the development
    set are set to -1, and the indices for the validation set are set to 0.'''

    validation_idx = np.repeat(-1, y_train.shape)
    np.random.seed(550)
    validation_idx[np.random.choice(validation_idx.shape[0], 
           int(round(.2*validation_idx.shape[0])), replace = False)] = 0

    # Now, create a list which contains a single tuple of two elements, 
    # which are arrays containing the indices for the development and
    # validation sets, respectively.
    validation_split = list(PredefinedSplit(validation_idx).split())

    # sanity check
    print(len(validation_split[0][0])) # outputs 258744 
    print(len(validation_split[0][0]))/float(validation_idx.shape[0])) # outputs .8
    print(validation_idx.shape[0] == y_train.shape[0]) # True
    print(set(validation_split[0][0]).intersection(set(validation_split[0][1]))) # set([])

Now, I run a grid search using GridSearchCV. My intention is that a model will be fit on the development set for each parameter combination over the grid, and the cross validation score will be recorded when the resulting estimator is applied to the validation set.

    # a vanilla XGboost model
    model1 = XGBClassifier()

    # create a parameter grid for the number of trees and depth of trees
    n_estimators = range(300, 1100, 100)
    max_depth = [8, 10]
    param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)

    # A grid search. 
    # NOTE: I'm passing a PredefinedSplit object as an argument to the `cv` parameter.
    grid_search = GridSearchCV(model1, param_grid,
           scoring='neg_log_loss',
           n_jobs=-1, 
           cv=validation_split,
           verbose=1)

Now, here is where a red flag is raised for me. I use the best estimator found by the gridsearch to find the accuracy on the validation set. It's very high - 0.89207865689639176. What's worse is that it's almost identical to the accuracy that I get if I use the classifier on the data development set (on which I just trained) - 0.89295597192591902. BUT - when I use the classifier on the true test set, I get a much lower accuracy, roughly .78:

    # accurracy score on the validation set. This yields .89207865
    accuracy_score(y_pred = 
           grid_result2.predict(X_train.iloc[validation_split[0][1]]),
           y_true=y_train[validation_split[0][1]])

    # accuracy score when applied to the development set. This yields .8929559
    accuracy_score(y_pred = 
           grid_result2.predict(X_train.iloc[validation_split[0][0]]),
           y_true=y_train[validation_split[0][0]])

    # finally, the score when applied to the test set. This yields .783 
    accuracy_score(y_pred = grid_result2.predict(X_test), y_true = y_test)

To me, the almost exact correspondence between the model's accuracy when applied to the development and validation datasets, and the significant loss in accuracy when applied to the test set is a clear sign that I'm training on the validation data by accident, and thus my cross validation score is not representative of the true accuracy of the model.

I can't seem to find where I went wrong - mostly because I don't know what GridSearchCV is doing under the hood when it receives a PredefinedSplit object as the argument to the cv parameter.

Any ideas where I went wrong? If you need more details/elaboration, please let me know. The code is also in this notebook on github.

Thanks!

Solution

You need to set refit=False (not a default option), otherwise the grid search will refit the estimator on the whole dataset (ignoring cv) after the grid search completes.