Search code examples
pythonvalidationmachine-learningscikit-learngrid-search

Python, machine learning - Perform a grid search on custom validation set


I am dealing with an unbalanced classification problem, where my negative class is 1000 times more numerous than my positive class. My strategy is to train a deep neural network on a balanced (50/50 ratio) training set (I have enough simulated samples), and then use an unbalanced (1/1000 ratio) validation set to select the best model and optimise the hyperparameters.

Since the number of parameters is significant, I want to use scikit-learn RandomizedSearchCV, i.e. a random grid search.

To my understanding, sk-learn GridSearch applies a metric on the training set to select the best set of hyperparameters. In my case however, this means that the GridSearch will select the model that performs best against a balanced training set, and not against more realistic unbalanced data.

My question would be: is there a way to grid search with the performances estimated on a specific, user-defined validation set?


Solution

  • As suggested in comments, the thing you need is PredefinedSplit. It is described in the question here

    As about the working, you can see the example given in the documentation:

    from sklearn.model_selection import PredefinedSplit
    X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
    y = np.array([0, 0, 1, 1])
    
    #This is what you need
    test_fold = [0, 1, -1, 1]
    
    ps = PredefinedSplit(test_fold)
    ps.get_n_splits()
    #OUTPUT
    2
    
    for train_index, test_index in ps.split():
       print("TRAIN:", train_index, "TEST:", test_index)
       X_train, X_test = X[train_index], X[test_index]
       y_train, y_test = y[train_index], y[test_index]
    
    #OUTPUT
    TRAIN: [1 2 3] TEST: [0]
    TRAIN: [0 2] TEST: [1 3]
    

    As you can see here, you need to assign the test_fold a list of indices, which will be used to split the data. -1 will be used for index of samples, which are not included in validation set.

    So in the above code, test_fold = [0, 1, -1, 1] says that in 1st validation set (indices in samples, whose value =0 in test_fold), index 0. And 2nd is where test_fold have value =1, so index 1 and 3.

    But when you say that you have X_train and X_test, if you want your validation set only from X_test, then you need to do the following:

    my_test_fold = []
    
    # put -1 here, so they will be in training set
    for i in range(len(X_train)):
        my_test_fold.append(-1)
    
    # for all greater indices, assign 0, so they will be put in test set
    for i in range(len(X_test)):
        my_test_fold.append(0)
    
    #Combine the X_train and X_test into one array:
    import numpy as np
    
    clf = RandomizedSearchCV( ...    cv = PredefinedSplit(test_fold=my_test_fold))
    clf.fit(np.concatenate((X_train, X_test), axis=0), np.concatenate((y_train, y_test), axis=0))