Search code examples
pythonpython-3.xmachine-learningscikit-learnclassification

The right way of using SMOTE in Classification Problems


What is the right way to implement SMOTE() in a classification modeling process? I am really confused about how to apply SMOTE() there. Say I have the dataset split into train and test like this as a starter:

from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.model_selection import GridSearchCV, train_test_split

# Some dataset initialization
X = df.drop(['things'], axis = 1)
y = df['things']

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# SMOTE() on the train dataset:
X_train_smote, y_train_smote = SMOTE().fit_resample(X_train, y_train, random_state=42)

After applying SMOTE() on the train dataset for classification problem above, My questions are:

  1. Should I apply SMOTE() inside the pipeline after splitting the dataset above like this?:
# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()),
                                ('over', SMOTE(random_state = 42)), 
                                ('model', LogisticRegression(random_state = 42))])

# Then do model evaluation with Repeated Stratified KFold,
# Then do Grid Search for hyperparameter tuning
# Then do the actual model testing with unseen X_test (Like this): 

cv = RepeatedStratifiedKFold(n_splits = 10, n_repeats = 3, random_state = 42)

params = {'model__penalty': ['l1', 'l2'],
          'model__C':[0.001, 0.01, 0.1, 5, 10, 100]}
    
grid = GridSearchCV(estimator = pipeline,
                    param_grid = params,
                    scoring = 'roc_auc',
                    cv = cv,
                    n_jobs = -1)

grid.fit(X_train_smote, y_train_smote)
    

cv_score = grid.best_score_
test_score = grid.score(X_test, y_test)

print(f"Cross-validation score: {cv_score} \n Test Score: {test_score}")
  1. Or, should I apply pipeline without calling SMOTE() like this?
# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()), 
                                ('model', LogisticRegression(random_state = 42))])

# Same process as above for modeling, evaluation, etc... 
  1. Or, should I use SMOTE() like this without using the SMOTE'd data like this:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()),
                                ('over', SMOTE(random_state = 42)), 
                                ('model', LogisticRegression(random_state = 42))])

# Same process as above for modeling, evaluation, etc... 

# BUT!, when fitting grid.fit(), we do this?:
grid.fit(X_train, y_train)
  1. Or use SMOTE() train data inside the Pipeline of Sklearn like this?:
X_train_smote, y_train_smote = SMOTE().fit_resample(X_train, y_train, random_state=42)

pipeline = Pipeline(steps = [('scale', StandardScaler()),
                             ('model', LogisticRegression(random_state = 42))])


# Same process as above for modeling, evaluation, etc... 

# BUT!, when fitting grid.fit(), we do this?:
grid.fit(X_train_smote, y_train_smote)


Solution

  • In general, you want to SMOTE the training data but not the validation or test data. So if you want to use folded cross-validation, you cannot SMOTE the data before sending it in to that process.

    1. No, you are running SMOTE twice (before and inside the pipeline). Also, you have SMOTEd points in the validation folds, which you don't want.
    2. No, you will have SMOTEd points in the validation folds.
    3. This is the way to do it.
    4. No, you will have SMOTEd points in the validation folds.

    I recommend looking at sklearn.metrics.roc_auc_score() as well as whatever other metrics you use, because it can reveal issues caused by improperly splitting resampled data. (SMOTEd points can be very predictable, but do not improve the AUC.)