Search code examples

Undersampling before or after Train/Test Split

I have a credit card dataset with 98% transactions are Non-Fraud and 2% are fraud.

I have been trying to undersample the majotrity class before train and test split and get very good recall and precision on the test set.

When I do the undersampling only on training set and test on the independent set I get a very poor precision but the same recall!

My question is :

Should I undersample before splitting into train and test , will this mess with the distribution of the dataset and not be representative of the real world?

Or does the above logic only apply when oversampling?

Thank you


  • If you have a chance to collect more data, that could be the best solution. (Assuming that you already attempted this step)

    If precision is poor and recall is good which indicating that your model is good at predicting fraud class as fraud but the model is confusing for nonfraud class, most of the times it is predicting nonfraud class as fraud (if you set 0 for majority class 1 for minority class). This means that you have to try on reducing the undersampling rate for the majority class.

    Typically undersampling/oversampling will be done on train split only, this is the correct approach. However,

    1. Before undersampling, make sure your train split has class distribution as same as the main dataset. (Use stratified while splitting)

    2. If you are using python sklearn library for training your classifier set the parameter class_weight='balanced'.

    For example:

       from sklearn.linear_model import LogisticRegression
       Lr = LogisticRegression(class_weight='balanced')
    1. Try with different algorithms with different hyperparameters, if the model is underfitting then consider choosing XGboost.

    If you do undersample before splitting then the test split distribution may not replicate the distribution of real-world data. Hence people typically avoid sampling before splitting.