Search code examples
scikit-learncross-validationhyperparameters

How to use cross-validation splitted data with RandomizedSearchCV


I'm trying to transfer my model from single run to hyper-parameter tuning using RandomizedSearchCV.

In my single run case, my data is splitted into train/validation/test data.

When I run RandomizedSearchCV on my train_data with default 3-fold CV, I notice that the length of my train_input is reduced to 66% of train_data (which makes sense in a 3-fold CV...).

So I'm guessing that I should merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.

Would that be the right way to go?

My question is: how can I access the remaining 33% of my train_input to feed it to my validation accuracy test function (note that my score function is running on test set)?

Thanks for your help! Yoann


Solution

  • I'm not sure that my code would help here since my question is rather generic.

    This is the answer that I found by going through sklearn's code: the RandomizedSearchCV doesn't return the splited validation data in an easy way and I should definitely merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.

    The train_data is splitted for CV using a cross-validator into a train/validation set (in my case, the Stratified K-Folds http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)

    My estimator is defined as follows:

    class DNNClassifier(BaseEstimator, ClassifierMixin):
    

    It needs a score function to be able to evaluate the CV performance on the validation set. There is a default score function defined in the ClassifierMixin class (which returns the the mean accuracy and requires a predict function to be implemented in the Estimator class).

    In my case, I implemented a custom score function within my estimator class.

    The hyperparameter search and CV fit is done calling the fit function of RandomizedSearchCV.

    RandomizedSearchCV(DNNClassifier(), param_distribs).fit(train_data)
    

    This fit function runs the estimator's custom fit function on the train set and then the score function on the validation set.

    This is done using the _fit_and_score function from the ._validation library.

    So I can access the automatically splitted validation set (33% of my train_data input) at the end of my estimator's fit function.

    I'd have preferred to access it within my estimator's fit function so that I can use it to plot validation accuracy over training steps and for early stop (I'll keep a separate validation set for that).

    I guess I could reconstruct the automatically generated validation set by looking for the missing indexes from my initial train_data (the train_data used in the estimator's fit function has 66% of the indexes of the initial train_data).

    If that is something that someone has already done I'd love to hear about it!