scikit-learn cross-validation hyperparameters

How to use cross-validation splitted data with RandomizedSearchCV

I'm trying to transfer my model from single run to hyper-parameter tuning using RandomizedSearchCV.

In my single run case, my data is splitted into train/validation/test data.

When I run RandomizedSearchCV on my train_data with default 3-fold CV, I notice that the length of my train_input is reduced to 66% of train_data (which makes sense in a 3-fold CV...).

So I'm guessing that I should merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.

Would that be the right way to go?

My question is: how can I access the remaining 33% of my train_input to feed it to my validation accuracy test function (note that my score function is running on test set)?

Thanks for your help! Yoann

Solution

I'm not sure that my code would help here since my question is rather generic.

This is the answer that I found by going through sklearn's code: the RandomizedSearchCV doesn't return the splited validation data in an easy way and I should definitely merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.

The train_data is splitted for CV using a cross-validator into a train/validation set (in my case, the Stratified K-Folds http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)

My estimator is defined as follows:

class DNNClassifier(BaseEstimator, ClassifierMixin):

It needs a score function to be able to evaluate the CV performance on the validation set. There is a default score function defined in the ClassifierMixin class (which returns the the mean accuracy and requires a predict function to be implemented in the Estimator class).

In my case, I implemented a custom score function within my estimator class.

The hyperparameter search and CV fit is done calling the fit function of RandomizedSearchCV.

RandomizedSearchCV(DNNClassifier(), param_distribs).fit(train_data)

This fit function runs the estimator's custom fit function on the train set and then the score function on the validation set.

This is done using the _fit_and_score function from the ._validation library.

So I can access the automatically splitted validation set (33% of my train_data input) at the end of my estimator's fit function.

I'd have preferred to access it within my estimator's fit function so that I can use it to plot validation accuracy over training steps and for early stop (I'll keep a separate validation set for that).

I guess I could reconstruct the automatically generated validation set by looking for the missing indexes from my initial train_data (the train_data used in the estimator's fit function has 66% of the indexes of the initial train_data).

If that is something that someone has already done I'd love to hear about it!