What is the right way to implement SMOTE()
in a classification modeling process? I am really confused about how to apply SMOTE()
there. Say I have the dataset split into train and test like this as a starter:
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.model_selection import GridSearchCV, train_test_split
# Some dataset initialization
X = df.drop(['things'], axis = 1)
y = df['things']
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
# SMOTE() on the train dataset:
X_train_smote, y_train_smote = SMOTE().fit_resample(X_train, y_train, random_state=42)
After applying SMOTE()
on the train dataset for classification problem above, My questions are:
SMOTE()
inside the pipeline after splitting the dataset above like this?:# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()),
('over', SMOTE(random_state = 42)),
('model', LogisticRegression(random_state = 42))])
# Then do model evaluation with Repeated Stratified KFold,
# Then do Grid Search for hyperparameter tuning
# Then do the actual model testing with unseen X_test (Like this):
cv = RepeatedStratifiedKFold(n_splits = 10, n_repeats = 3, random_state = 42)
params = {'model__penalty': ['l1', 'l2'],
'model__C':[0.001, 0.01, 0.1, 5, 10, 100]}
grid = GridSearchCV(estimator = pipeline,
param_grid = params,
scoring = 'roc_auc',
cv = cv,
n_jobs = -1)
grid.fit(X_train_smote, y_train_smote)
cv_score = grid.best_score_
test_score = grid.score(X_test, y_test)
print(f"Cross-validation score: {cv_score} \n Test Score: {test_score}")
SMOTE()
like this?# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()),
('model', LogisticRegression(random_state = 42))])
# Same process as above for modeling, evaluation, etc...
SMOTE()
like this without using the SMOTE'd data like this:# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()),
('over', SMOTE(random_state = 42)),
('model', LogisticRegression(random_state = 42))])
# Same process as above for modeling, evaluation, etc...
# BUT!, when fitting grid.fit(), we do this?:
grid.fit(X_train, y_train)
SMOTE()
train data inside the Pipeline of Sklearn like this?:X_train_smote, y_train_smote = SMOTE().fit_resample(X_train, y_train, random_state=42)
pipeline = Pipeline(steps = [('scale', StandardScaler()),
('model', LogisticRegression(random_state = 42))])
# Same process as above for modeling, evaluation, etc...
# BUT!, when fitting grid.fit(), we do this?:
grid.fit(X_train_smote, y_train_smote)
In general, you want to SMOTE the training data but not the validation or test data. So if you want to use folded cross-validation, you cannot SMOTE the data before sending it in to that process.
I recommend looking at sklearn.metrics.roc_auc_score()
as well as whatever other metrics you use, because it can reveal issues caused by improperly splitting resampled data. (SMOTEd points can be very predictable, but do not improve the AUC.)