Search code examples
scikit-learntraining-data

Applying undersampling techniques to train and test data


I know if you perform some sort of transformation and you use fit() then you have to transform() both the training set and the test set.

Suppose you apply a targeted undersampling technique such as TomekLinks to your training data to allow the model to better identify\separate classes.

  • Question: If you are going to use the model to predict against a test set, do you also perform the same undersampling technique against the test set, or is the undersampling only used on the training set to assist the model in clarifying class boundaries. The trained model would then be applied against the full test set.

Solution

  • I don't think you should undersample your test data. While it is perfectly resonable to do it the the training data, doing it on the test data is unrealistic. If the model is intended for any online application, it needs to be tested on the real, unbalanced dataset.