Search code examples
pythonscikit-learncross-validation

Is there a way to define the fraction of each label I want in sci-kit learn cross validation?


I've written a simple Python script that uses sklearn.neural_network.MLPClassifier and sklearn.model_selection.GridSearchCV to make predictions about binary classification data, each point being labelled either 0 or 1. In the training data, roughly 90% have the label 1 and 10% have the label 0. In the test data, roughly 35% have the label 1 and 65% have the label 0. This proportion is known, although the labels aren't known.

My model is currently over-fitting. My cross-validation score for the training data is 85-90%, but the score when I run the code on the test set is below 40%.

One workaround I've thought of is that I could try setting GridSearchCV to split the data so that each training/validation set has approximately the same proportion of labels as the test data. This doesn't seem to be an option with this library however, and my google-fu hasn't returned any results in terms of other sci-kit learn programmes.

Are there any other libraries I could use, or a parameter I could input that I haven't managed to find? Thank you.


Solution

  • I would suggest the imblearn library, as it offers a great variety of methods for re-sampling. I do not know the size or other specifics of your data set, but in general, I would argue that oversampling strategies should be favored over undersampling ones. You could for example use SMOTE to oversample your 0 labels in the training set. The sampling_strategy parameter also allows you to specify your desired ratio beforehand.