Search code examples
machine-learningscikit-learncross-validationgridsearchcvk-fold

Do you predict on testdata after crossvalidation (gridsearchcv w/ KFold) and how?


Backround:

I work on a project regarding a mulit-class classification problem using scikit-learn. My dataset contains 112 feature vectors for each of the 40 measured objects (MO). In total 4480 feature vectors, equally divided in 4 classes, and 533 features. (More information on the data set here)

Approach:

After splitting the dataset (train:34 MO, test:6 MO) and reducing the number of features, mostly through PCA, I tuned the hyperparameters with gridsearchcv using KFold for different models to make a comparison.

Questions:

  1. When evaluating, is the split in train and test data necesarry? My prof says no, crossvalidation in itself makes that obsolete. This defies my basic understanding of machine learning best practices and the crossvalidation documentation of sklearn.
  2. Do I have to account for the feature space of each MO in the test set when predicting/evaluating and if yes, how would I do that? E.g. run a crossvalidation style prediction on the test data, or just predict on the test data as a whole.

Solution

  • The comment by @4.Pi.n solved my problem:

    1. It's exactly as your professor says,
    2. The most common way is to storing k-models, then averaging there predictions, ex. y_pred = (pred_1 + pred_2 + ... + pred_k) / k, or you might use sklearn.model_selection.cross_val_predict