python machine-learning scikit-learn imblearn

Imblearn Pipeline resulting in poor metrics

I am working on an imbalanced dataset which is created using the below code

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1,
                            weights=[0.99], flip_y=0, random_state=1)

I tried getting rid of the imbalance using SMOTE oversampling and then tried fitting a ML model. This was done using the normal method and then by creating a pipeline.

Normal method

from imblearn.over_sampling import SMOTE

oversampled_data = SMOTE(sampling_strategy=0.5)
X_over, y_over = oversampled_data.fit_resample(X, y)

logistic = LogisticRegression(solver='liblinear')

scoring = ['accuracy', 'precision', 'recall', 'f1']
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluating the model
scores = cross_validate(logistic, X_over, y_over, scoring=scoring, cv=cv, n_jobs=-1,  return_train_score=True)

print('Accuracy: {:.2f}, Precison: {:.2f}, Recall: {:.2f} F1: {:.2f}'.format(np.mean(scores['test_accuracy']), np.mean(scores['test_precision']), np.mean(scores['test_recall']), np.mean(scores['test_f1'])))

Output - Accuracy: 0.93, Precison: 0.92, Recall: 0.86, F1: 0.89

Pipeline

from imblearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import cross_val_score

oversampled_data = SMOTE(sampling_strategy=0.5)

pipeline = Pipeline([('smote', oversampled_data), ('model', LogisticRegression())])

# pipeline = make_pipeline(oversampled_data, logistic)

scoring = ['accuracy', 'precision', 'recall', 'f1']
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluating the model
scores = cross_validate(pipeline, X, y, scoring=scoring, cv=cv, n_jobs=-1,  return_train_score=True)

print('Accuracy: {:.2f}, Precison: {:.2f}, Recall: {:.2f}, F1: {:.2f}'.format(np.mean(scores['test_accuracy']), np.mean(scores['test_precision']), np.mean(scores['test_recall']), np.mean(scores['test_f1'])))

Output - Accuracy: 0.96, Precison: 0.19, Recall: 0.84, F1: 0.31

What am I doing wrong when using a Pipeline, why is the Precision and F1 score so poor when using a pipeline?

Solution

In the first approach, you create the synthetic examples before splitting the training and test sets, whereas in the second you do it after splitting.

The former approach adds synthetic datapoints to the test set, but the latter does not. Furthermore, the former approach produces inflated scores from data leakage: it adds the synthetic test samples based (in part) on some datapoints from the training dataset. See e.g.