I've written a simple Python script that uses sklearn.neural_network.MLPClassifier
and sklearn.model_selection.GridSearchCV
to make predictions about binary classification data, each point being labelled either 0 or 1. In the training data, roughly 90% have the label 1 and 10% have the label 0. In the test data, roughly 35% have the label 1 and 65% have the label 0. This proportion is known, although the labels aren't known.
My model is currently over-fitting. My cross-validation score for the training data is 85-90%, but the score when I run the code on the test set is below 40%.
One workaround I've thought of is that I could try setting GridSearchCV
to split the data so that each training/validation set has approximately the same proportion of labels as the test data. This doesn't seem to be an option with this library however, and my google-fu hasn't returned any results in terms of other sci-kit learn programmes.
Are there any other libraries I could use, or a parameter I could input that I haven't managed to find? Thank you.
I would suggest the imblearn
library, as it offers a great variety of methods for re-sampling. I do not know the size or other specifics of your data set, but in general, I would argue that oversampling strategies should be favored over undersampling ones. You could for example use SMOTE
to oversample your 0 labels in the training set. The sampling_strategy
parameter also allows you to specify your desired ratio beforehand.