Search code examples
machine-learningsvmcross-validationtraining-data

Perform cross-validation on training or validation partition to tune parameters


I have a large dataset which is partitioned into three chuncks (train-validate-test). And I want to perform cross-validation (CV) , since I have a large dataset it will take too long to perform CV on the entire dataset. What is the right partition to perform the CV on? I've seen tutorials which use only the training split and others use only the validation split while others use the entire dataset.

Thank you for any clarification or help.


Solution

  • To simplify things, let's assume that you only have one hyper-parameter. If you want to do cross validation, you would choose N different values of the hyper-parameter and train the N different models on the training set. You would then choose the hyper-parameter that had best performance on the validation set. Then you would retrain the model using both the training and validation set using the selected hyper-parameter. The model performance is then evaluated on the test set.

    If you data set is huge, you could select a small subset, find the optimal hyper-parameters and continue to increase the subset until you can infer what the optimal hyper-parameters would be at your full dataset size. In practice you can often get away with selecting as large a subset at you can be bothered and just use the optimal hyper-parameters for that subset.

    EDIT:

    If you use scikit-learn, here's the pseudo code for a hypothetical model with a hyper-parameter C:

    from sklearn.model_selection import GridSearchCV
    
    # X_train, X_test are the train and test features
    # Y_train, Y_test are the corresponding labels/values to predict.
    # model is some scikit-learn regression or classification model
    
    # Create a parameter grid
    param_grid = {'C': [0.1, 1, 5, 15, 100]}
    
    # Do two fold CV. You can do other types of CV as well by passing
    # a cross-validation generator
    estimator = GridSearchCV(model, cv=2, param_grid=param_grid)
    # Do the cross validation procedure explained below
    estimator.fit(X_train, Y_train)
    

    What happens when you run the fit method is that you split your training set (X_train, Y_train), into two. Then you train the model with C=0.1 using the first half of the data and score the performance on the second half. In this case the first half is the training set and the second half is the validation set. Afterwards you repeat the procedure but using the second half as the training set and the first half as the validation set. The performance is then averaged and stored.

    You then repeat this procedure for the remaining values of C. Then you check which value of C gives the best prediction accuracy. That value is then used train a final model using the entire training set (X_train, Y_train). The model performance can then be evaluated on the left out test set by

    score = estimator.score(X_test, Y_test)