python validation machine-learning scikit-learn grid-search

Python, machine learning - Perform a grid search on custom validation set

I am dealing with an unbalanced classification problem, where my negative class is 1000 times more numerous than my positive class. My strategy is to train a deep neural network on a balanced (50/50 ratio) training set (I have enough simulated samples), and then use an unbalanced (1/1000 ratio) validation set to select the best model and optimise the hyperparameters.

Since the number of parameters is significant, I want to use scikit-learn RandomizedSearchCV, i.e. a random grid search.

To my understanding, sk-learn GridSearch applies a metric on the training set to select the best set of hyperparameters. In my case however, this means that the GridSearch will select the model that performs best against a balanced training set, and not against more realistic unbalanced data.

My question would be: is there a way to grid search with the performances estimated on a specific, user-defined validation set?

Solution

As suggested in comments, the thing you need is PredefinedSplit. It is described in the question here

As about the working, you can see the example given in the documentation:

from sklearn.model_selection import PredefinedSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])

#This is what you need
test_fold = [0, 1, -1, 1]

ps = PredefinedSplit(test_fold)
ps.get_n_splits()
#OUTPUT
2

for train_index, test_index in ps.split():
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

#OUTPUT
TRAIN: [1 2 3] TEST: [0]
TRAIN: [0 2] TEST: [1 3]

As you can see here, you need to assign the test_fold a list of indices, which will be used to split the data. -1 will be used for index of samples, which are not included in validation set.

So in the above code, test_fold = [0, 1, -1, 1] says that in 1st validation set (indices in samples, whose value =0 in test_fold), index 0. And 2nd is where test_fold have value =1, so index 1 and 3.

But when you say that you have X_train and X_test, if you want your validation set only from X_test, then you need to do the following:

my_test_fold = []

# put -1 here, so they will be in training set
for i in range(len(X_train)):
    my_test_fold.append(-1)

# for all greater indices, assign 0, so they will be put in test set
for i in range(len(X_test)):
    my_test_fold.append(0)

#Combine the X_train and X_test into one array:
import numpy as np

clf = RandomizedSearchCV( ...    cv = PredefinedSplit(test_fold=my_test_fold))
clf.fit(np.concatenate((X_train, X_test), axis=0), np.concatenate((y_train, y_test), axis=0))